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Preface 



The 1st Computational Methods for SNPs and Haplotype Inferenee Workshop was 
held on November 21-22, 2002 at the DIMACS Center for Diserete Mathematies and 
Theoretieal Computer Seienee. 

The workshop foeused on methods for SNP and haplotype analysis and their 
applieations to disease assoeiations. The ability to seore large numbers of DNA 
variants (SNPs) in large samples of humans is rapidly aeeelerating, as is the demand 
to apply these data to tests of assoeiation with diseased states. The problem suffers 
from exeessive dimensionality, so any means of redueing the number of dimensions 
of the spaee of genotype elasses in a biologieally meaningful way would likely be of 
benefit. Linked SNPs are often statistieally assoeiated with one another (in "linkage 
disequilibrium"), and the number of distinet eonfigurations of multiple tightly linked 
SNPs in a sample is often far lower than one would expeet from independent 
sampling. These joint eonfigurations, or haplotypes, might be a more biologieally 
meaningful unit, sinee they represent sets of SNPs that eo-oeeur in a population. 
Reeently there has been mueh exeitement over the idea that sueh haplotypes oeeur as 
bloeks aeross the genome, as these bloeks suggest that fewer distinet SNPs need to be 
seored to eapture the information about genotype identity. There is need for formal 
analysis of this dimension reduetion problem, for formal treatment of the hierarehieal 
strueture of haplotypes, and for eonsideration of the utility of these approaehes toward 
meeting the end goal of finding genetie variants assoeiated with eomplex diseases. 

The workshop featured the following invited speakers: 

Peter Donnelly (Oxford University), Kathryn Roeder (Carnegie Mellon University), 
Jonathan Pritehard (University of Chieago), Molly Przeworski (Max Planek Institute), 
Maoxia Zheng (University of Chieago), Elizabeth Thompson (University of 
Washington), Monty Slatkin (University of California, Berkeley), Dahlia Nielsen 
(North Carolina State University), Matthew Stephens (University of Washington), 
Andrew Clark (Cornell University and Celera/ Applied Biosystems), Sorin Istrail 
(Celera/Applied Biosystems), David Cutler (Johns Hopkins University), Magnus 
Nordborg (University of Southern California), Braee Rannala (University of Alberta), 
Russell Sehwartz (Carnegie Mellon University), Fengzhu Sun (University of Southern 
California), Jun Liu (Stanford University), Jinghui Zhang (National Caneer Institute, 
NIH), Dan Gusfield (University of California, Davis), Eran Halperin (University of 
California, Berkeley), Vineet Bafna (The Center for the Advaneement of Genomies), 
David Altsehuler (Harvard Medieal Sehool), Naney Cox (University of Chieago), 
Franeiseo de la Vega (Applied Biosystems), Li Jin (University of Cineinnati) and 
Steve Sherry (National Center for Bioteehnology Information, NIH). 

This volume ineludes papers on whieh the presentations of Andrew Clark, Dan 
Gusfield, Sorin Istrail, Tianhua Niu, Jonathan Pritehard, Russell Sehwartz, Elizabeth 
Thompson, Bmee Rannala, Fengzhu Sun, and Maoxia Zheng were based. It also 
ineludes the eolleetion of abstraets of all the presentations. 




VI 



Preface 



We would like to thank Merissa Henry for outstanding workshop organization and 
editorial support. We would also like to thank the DIMACS Center for Diserete 
Mathematies and Theoretieal Computer Seienee for providing fmaneial support and 
exeellent organization of the workshop. 

January 2004 Andrew G. Clark 

Sorin Istrail 
Miehael Waterman 

Workshop Organizers 
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Abstract. Variability in health risks among individuals with Down syndrome 
suggests the possibility of underlying polymorphism on chromosome 21 that is 
responsible for the variation in risk. Inference of association between risk and 
SNPs can be improved with haplotype information, motivating the problem of 
solving haplotype phases in the case of trisomy. Any gene on chromosome 21 
in a Down patient will have the haplotypes of the three copies of chromosome 
21 confounded. A variety of methods including EM and Bayesian analysis 
have provided useful solutions for haplotype phasing of disomic genotypes. 
Here we provide an extension to the Bayesian approach for inferring linkage 
phase in trisomic samples. 



1 Introduction 

The most common genetic disorder in humans is Down syndrome, or trisomy 21. 
This disorder is caused by failure of chromosome 21 to properly disjoin in formation 
of either the male or (more often) female gametes. Down syndrome is characterized 
by a wide range of severity of manifested problems, from mental retardation to life 
threatening heart defects. It is of considerable medical importance to understand why 
there is such variation in manifestation of Down syndrome defects, not only for 
diagnosis and treatment of Down patients, but because they may shed light on related 
conditions in non-Down patients. Trisomy 21 poses a number of intriguing problems 
for the human geneticist. Most trisomies are fatal, yet trisomy 21 has effects that are 
comparatively mild. Translocations between chromosome 21 and other autosomes, 
most commonly chromosome 9, have resulted in individuals that have three copies of 
only part of chromosome 21 (for review, see Shapiro 1997). Some of these 
individuals are lacking most of the symptoms of Down syndrome. By examining 
many such translocations, it has been possible to identify a region of chromosome 21 
that seems to be most responsible for the characterisitic features of Down syndrome. 

One would like to understand which particular genes on chromosome 21, when 
present in 3 copies, cause defects in brain or heart development. One promising 
approach is an application of linkage disequilibrium mapping, testing direct 
association between SNP genotypes and the severity of the disease. With only about 
300 genes on chromosome 21 (Gitton et al. 2002), and perhaps 50 in the Down 
Syndrome Critical Region (Shapiro 1997), such an association mapping is likely to 
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proceed by identification of putative candidate genes and directly testing association 
between SNPs in each candidate gene and the particular disorder. Although it is 
possible to perform association tests to the raw or unphased SNP data, for reasons 
given later, it is likely to be more powerful to first infer the linkage phase of the 
SNPs, and to then apply the tests of association on these inferred haplotypes. 



2 The Problem 

If a particular nucleotide on chromosome 21 is polymorphic in the population, with 
alternative alleles A and G, then a Down patient could have genotype A/ A/A, A/A/G, 
A/G/G, or G/G/G. In order to carry out the association test described above in Down 
patients, one needs to be able to distinguish between the two heterozygous 
possibilities instead of simply calling the genotype A/G. This can be done with 
pyrosequencing (Ahmadian et al. 2000), a SNP detection method whose signal is 
quantitative, so that one learns whether, in this example, there were 2 As or 2 Gs. The 
genotype information available for analysis here consists of information on the three 
alleles that each individual has at each of 5 distinct SNPs. The data look something 
like A/A/G, T/C/C, G/C/C, T/A/A, G/G/G, for each individual. As in the case with 
diploid data, it is often desirable to infer the linkage phase of these three haplotypes. 
This one individual has three copies of chromosome 21, so there may be 3 distinct 
haplotypes. Some of the admissible haplotypes in this example include {ATGTG, 
ACCAG, GCCAG, GTGTG, ACGTG, GTCTG}. In the case of diploidy, if there are 
n heterozygous SNPs, there are 2" possible haplotypes. In this trisomic case, there are 
still only two distinct nucleotides at each SNP, so the number of admissible 
haplotypes is still 2". But the number of possible genotypes in the diploid case is 2" ', 
and the number of possible genotypes in the trisomic case is 3"'* because individuals 
can have either 2 or 3 distinct haplotypes. This multiplicity of admissible genotypes 
raises challenges to haplotype phase inference. 

The Down condition arises from a nondisjunction at either the first meiotic division 
or the second meiotic division, and these two cases have very different consequences 
for linkage phase inference. In the second meiotic division, sister chromatids 
normally disjoin, so if there is a failure of disjunction, one gamete ends up with two 
copies of the same sister chromatid. Sister chromatids are identical copies of the 
same chromosome, so if this gamete should be fertilized by another normal gamete, 
the trisomic individual that results would have two identical copies of one haplotype 
from one parent, and a second distinct haplotype from the other parent. 
Nondisjunction in meiosis I on the other hand results in a gamete having one copy of 
each of the parent's maternal and paternal chromosome 21. This gamete, when 
fertilized, produces a trisomic individual now having three distinct haplotypes (unless 
by chance two happen to be identical). By scoring microsatellites and other markers, 
it has been possible to infer which parent and in which division meiotic 
nondisjunction has occurred, and these studies have shown that about 77% of the 
cases of Down syndrome are caused by nondisjunction in meiosis I and the remaining 
23% are caused by a nondisjunction in meiosis II. 
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Figure 1. (left) Nondisjunction in the first meiotic division, showing how the disomic 
gamete has a copy of both a maternal and paternal chromatid. When this gamete is 
fertilized, potentially three distinct haplotypes are present in one individual, (right) 
Nondisjunction in the second meiotic division, showing how the disomic gamete has 
two identical haplotypes. In the case of meiosis II nondisjunction, inference of 
haplotype phase is trivial. 



3 Statistical Inference of Linkage Phase 

Phase inference in trisomic individuals is not identical to the diploid case in part 
because of the multiplicity of admissible unphased genotypes, but also because of the 
different consequences of nondisjunction in meiosis I vs meiosis II. Fortunately this 
complication is easily handled by a Monte Carlo markov chain, simply by allowing 
each individual to be in one of two different states (meiosis I or meiosis II 
nondisjunction), defining priors as the probabilities of being in each of these two 
states, and stepping rules for changing an individual from one state to another. 
Because these two states have a strong impact on the probability of genotypes for 
each individual, it is likely that the data will provide good discrimination between an 
individual's having a meiosis I or meiosis II nondisjunction. 

In the case of nondisjunction in the second meiotic division, the Down patient 
receives two identical copies of chromosome 21 (they are sister chromatids) from the 
parent that had the nondisjunction. In this case, phase inference is trivial, because any 
SNP that has two copies of one nucleotide and one copy of the other, must have two 
distinct haplotypes, one with all the singleton alleles, and the other haplotype has all 
the nucleotides present in two doses. 

The essence of the algorithm is to apply Bayes rule, where there is a prior 
probability of nondisjunction in meiosis I of 0.77 (leaving a prior of 0.23 for meiosis 
II). Define vector with value 1 or 2, indicating for individual i in the sample which 
division of meiosis the nondisjunction occurred. 

We next define a markov chain whose stationary distribution is provides the 
posterior probability density for the phased genotypes. The advantage of this 
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approach is that we also learn alternative phasings besides the most probably one, and 
we learn the relative likelihoods of each (Stephens et al. 2001). 

The steps of the Markov chain are: 

1. Initiate the procedure by assuming that all individual had a meiosis II 
nondisjunction. This specifies the complete set of haplotypes. 

2. Calculate the likelihood of the current parameter set based on multinomial 
sampling from a population expected to be in Hardy-Weinberg equilibrium 

3. Propose a swap of one individual’s linkage phase. If the individual is currently 
specified as a meiosis II nondisjunction, then change them to a meiosis I 
nondisjunction, propose a linkage phase, and calculate the likelihood of this state. 
If they are already meiosis I, then draw a random number, and if it is less than the 
prior probability of meiosis II (0.23) then swap back to meiosis II. Otherwise, 
retain the meiosis I state but swap the linkage phase. Calculate the likelihood. 

4. Accept or reject the new step based on the Hastings ratio. 

Steps 2-4 are repeated until convergence. A test of this MCMC algorithm was 
done by presenting to the computer a series of datasets that were generated by 
simulation. We used Hudson's (1999) program to generate haplotypes of specified 
number of SNPs and ancestral recombination rate. To generate test data, these 
haplotypes were drawn at random in sets of three 77% of the time (to represent 
meiosis I lagging individuals) and the remaining 23% of the time, two haplotypes 
were drawn at random, and one was copied to produce an individual resulting from a 
meiosis II nondisjunction. 



4 The Data 

The disorder we will focus on is atrioventricular septal defect, a life-threatening heart 
defect present in about 5% of Down patients that is correctable by surgery. Only 
individuals trisomic for the Down-sensitive region of chromosome 21 get AVSD, so 
both the cases (AVSD-l-) and the controls (AVSD-) are Down patients. There is a 
previously identified congenital heart defect region (CHD critical region — Korenberg 
et al. 1992), within which we have selected a gene, SH3BGR, that is highly expressed 
in the developing heart of the mouse embryo, and likely the human embryo as well 
(Egeo et al. 2000; Reymond et al. 2002). 

The data consist of 5 SNPs in the candidate gene SH3BGR scored in 83 cases of 
atrioventricular canal defect, and 80 controls (who have Down syndrome but not 
AVSD). Genotypes were obtained by pyrosequencing, so that for every heterozygous 
SNP, we know which nucleotide is present in 2 copies and which is present in 1 copy. 
If the SNP has a low frequency of the minor allele, many individuals will have all 
three copies of the same nucleotide, and in these cases there is no ambiguity in 
haplotype phase. As soon as an individual has more than one heterozygous SNP, the 
phasing problem arises. 
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Table 1. The 5-SNP genotypes of the first 21 individuals. All 5 SNPs are 
within the SH3BGR loeus. All subjeets have Down syndrome and exhibit 
three eopies of eaeh SNP. The data were generated by pyrosequeneing, and 
haplotype phase is unknown. 



5 Results 

For both the simulation mns and the SFI3BGR data, the MCMC attained stationarity 
quiekly, and numerous trials from different starting eonditions eonverged to the same 
solution. We obtained an assessment of the aeeuraey of the phase predietions over a 
range of sample sizes, number of SNPs (detemiined by 0 = 4Np) and population 
reeombination rate, p. The latter is important beeause it determines the degree of 
linkage disequilibrium, and phasing is more ehallenging the lower is the linkage 
disequilibrium. For reasonable sample sizes, with values of 0 and p that approximate 
what is found for human eandidate genes, the method provided highly reliable 
haplotype phasing in all but a few eases. As for the diploid ease, aeeuraey is 
partieularly high for eommon haplotypes, and less so for rare haplotypes. 

The SFI3BGR results are presented in Table 2. With a sample of more than 160 
and just 5 segregating SNPs, these data were readily resolved into reliable haplotypes. 
From the inferred eounts of haplotypes, we ean test whether the individuals with the 
septal defeet have a different pattern of haplotypes than the individuals without the 
septal defeet. The ehi-square for Table 2 (eollapsing eells with eounts less than 5) is 
23.1, and with 8 df, this is signifieant at P = 0.003. Looking elosely at the inferred 
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haplotype counts, it is clear that the haplotype TgxCA is highly informative. The 
haplotype TGTCA is much more likely to be associated with a septal defect, whereas 
TGCCA is more likely to be normal. 



Haplotype 


Septal defect 


Normal heart 


TATCC 


67 


45 


TATCA 


16 


8 


CGTTA 


14 


11 


CATCA 


12 


11 


TGTCA 


8 


0 


TGCCA 


8 


18 


TGTTA 


5 


2 


CATCC 


2 


5 


TGTCC 


2 


0 


CGTCC 


1 


4 


TGCCC 


1 


5 


CGTCA 


1 


0 


CGTTC 


1 


4 



Table 2. Counts of the 13 most common haplotypes inferred from the 
genotypes of Down patients that either do or do not have the atrioventricular 
septal defect. 



6 Discussion 

Given the importance of being able to perform association tests in cases of trisomic 
genotypes, there is considerable motivation to obtain efficient and accurate means for 
inference of linkage phase from trisomic genotype data. An important caveat is that 
genetic differences responsible for among-individual variability in severity of Down 
symptoms need not all reside on chromosome 21, so that there is still motivation to 
look beyond chromosome 21 for SNPs that are responsible for the differences. But as 
a starting point, given the role of gene dosage in many disorders, it is sensible to 
begin looking on the chromosome whose dosage is aberrant. 

The Bayesian approach presented here depends on there being Hardy -Weinberg 
proportions of haplotypes in the population, and obtains a phasing solution that 
provides the best fit to this Hardy- Weinberg state. This implies that the method 
works best for sample sizes large enough to obtain homozygotes of most of the 
haplotypes. What this means in absolute numbers depends on the numbers of SNPs 
and their relative frequencies, which in turn impacts the numbers of true haplotypes in 
the population and their relative frequencies. As in the example presented here, much 
of the motivation for inference of haplotype phase arises from the case of testing 
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association in candidate genes. The number of SNPs within a candidate gene region 
can exceed 100 (see the Seattle SNPs project, at pga.gs.washington.edu), however due 
to linkage disequibrium, it is almost never necessary to genotype this many SNPs for 
an association test at this scale. In sum, the method presented here scales well for a 
reasonable size study of perhaps 200 to 500 cases and like number of controls 
geno typed in 5 to 20 SNPs. 

We also note that it is possible to perform an association test based on raw 
genotype data rather than on data for which the linkage phases have been inferred. In 
the case of diploid data, these phase-free methods of association testing seem to 
perform quite well (Liu et al. 2003). Given the added multiplicity of potential phased 
genotypes given unphased trisomic SNP data, it seems likely that initial phase 
inference will provide a greater advantage in the trisomic case. This is an issue that 
merits further investigation, most likely by simulation. 
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Abstract. A cnrrent high-priority phase of human genomics involves 
the development of a full Haplotype Map of the human genome [23]. It 
will be used in large-scale screens of populations to associate specific hap- 
lotypes with specific complex genetic-influenced diseases. A key, perhaps 
bottleneck, problem is to computationally infer haplotype pairs from 
genotype data. This paper follows the talk given at the DIMACS Con- 
ference on SNPs and Haplotypes held in November of 2002. It reviews 
several combinatorial approaches to the haplotype inference problem that 
we have investigated over the last several years. In addition, it updates 
some of the work presented earlier, and discusses the current state of our 
work. 



1 Introduction to SNP’s, Genotypes, and Haplotypes 

In diploid organisms (such as humans) there are two (not completely identical) 
“copies” of each chromosome, and hence of each region of interest. A description 
of the data from a single copy is called a haplotype, while a description of the 
conflated (mixed) data on the two copies is called a genotype. In complex diseases 
(those affected by more than a single gene) it is often much more informative to 
have haplotype data (identifying a set of gene alleles inherited together) than to 
have only genotype data. 

The underlying data that forms a haplotype is either the full DNA sequence 
in the region, or more commonly the values of single nucleotide polymorphisms 
(SNP’s) in that region. A SNP is a single nucleotide site where exactly two (of 
four) different nucleotides occur in a large percentage of the population. The 
SNP-based approach is the dominant one, and high density SNP maps have 
been constructed across the human genome with a density of about one SNP 
per thousand nucleotides. 

1.1 The Biological Problem 

In general, it is not feasible to examine the two copies of a chromosome sep- 
arately, and genotype data rather than haplotype data will be obtained, even 
though it is the haplotype data that will be of greatest use. 

* Research Supported by NSF grants DBI-9723346 and EIA-0220154 
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Data from m sites (SNP’s) in n individuals is collected, where each site can 
have one of two states (alleles), which we denote by 0 and 1. For each individ- 
ual, we would ideally like to describe the states of the m sites on each of the 
two chromosome copies separately, i.e., the haplotype. However, experimentally 
determining the haplotype pair is technically difficult or expensive. Instead, the 
screen will learn the 2m states (the genotype) possessed by the individual, with- 
out learning the two desired haplotypes for that individual. One then uses com- 
putation to extract haplotype information from the given genotype information. 
Several methods have been explored and some are intensively used for this task 
[6,7,12,34,17,31,28,29]. None of these methods are presently fully satisfactory, 
although many give impressively accurate results. 

1.2 The Computational Problem 

Abstractly, input to the haplotyping problem consists of n genotype vectors, each 
of length TO, where each value in the vector is either 0,1, or 2. Each position in 
a vector is associated with a site of interest on the chromosome. The position in 
the genotype vector has a value of 0 or 1 if the associated chromosome site has 
that state on both copies (it is a homozygous site), and has a value of 2 otherwise 
(the chromosome site is heterozygous) . 

Given an input set of n genotype vectors, a solution to the Haplotype In- 
ference (HI) Problem is a set of n pairs of binary vectors, one pair for each 
genotype vector. For any genotype vector g, the associated binary vectors vi,V 2 
must both have value 0 (or 1) at any position where g has value 0 (or 1); but 
for any position where g has value 2, exactly one of vi,V 2 must have value 0, 
while the other has value 1. That is, vi,V 2 must be a feasible “explanation” for 
the true (but unknown) haplotype pair that gave rise to the observed genotype 
g. Hence, for an individual with h heterozygous sites there are haplotype 
pairs that could appear in a solution to the HI problem. 

For example, if the observed genotype g is 0212, then the pair of vectors 0110, 
0011 is one feasible explanation, out of two feasible explanations. Of course, we 
want to find the explanation that actually gave rise to g, and a solution for the 
HI problem for the genotype data of all the n individuals. However, without 
additional biological insight, one cannot know which of the exponential number 
of solutions is the “correct one” . 

1.3 The Need for a Genetic Model 

Algorithm-based haplotype inference would be impossible without the implicit 
or explicit use of some genetic model, either to asses the biological fidelity of any 
proposed solution, or to guide the algorithm in constructing a solution. Most of 
the models use statistical or probabilistic aspects of population genetics. We will 
take a more deterministic or combinatorial approach. 

In this paper we will review several combinatorial investigations into the 
haplotype inference problem, and in each case, discuss the implicit or explicit 
genetic model that is involved. 
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2 Optimizing Clark’s Method 

A. Clark, in [6] was the first to propose an algorithm to solve the haplotype 
inference problem. It has been widely used, and is still in use today. We will 
explain the method is a somewhat more abstract setting. 

Abstractly, input consists of n vectors, each of length m, where each value 
in the vector is either 0,1, or 2. Each position in a vector is associated with a 
site of interest on the chromosome. The state of any site on the chromosome is 
either 0 and 1. The associated position in the vector has a value of 0 or 1 if the 
chromosome site has that state on both copies (it is a homozygous site), and it 
has a value of 2 if both states are present (it is a hetrozygous site) . A position 
is considered “resolved” if it contains 0 or 1, and “ambiguous” if it contains 
a 2. A vector with no ambiguous positions is called “resolved”, and otherwise 
called “ambiguous”, haplotype pair. Given two non-identical resolved vectors R 
and NR, the conflation of R and NR produces the ambiguous genotype vector 
A, with entry 0 (respectively 1) at each site where both R and NR have 0 
(respectively 1) entries, and with entry 2 at each site where the entries of R and 
NR differ. 

The method of Clark begins by identifying any vector in the input with zero 
or one ambiguous sites, since in the first case the original two haplotypes are 
identical to the genotype vector, and in the second case the one ambiguous site 
has a forced resolution, producing two forced haplotype vectors. These identified 
haplotypes are called the initial resolved vectors (haplotypes). Clark’s method 
requires that some initial resolved haplotypes are available in this way. 

The main part of Clark’s method resolves remaining ambiguous genotypes 
by expanding out from the initial resolved haplotypes. Clark [6] proposed the 
following rule that infers a new resolved vector NR (or haplotype) from an am- 
biguous vector A and an already resolved vector R. The resolved vector R can 
either be one of the input resolved vectors, or a resolved vector inferred by an 
earlier application of the Inference Rule. 

Inference Rule: Suppose A is an ambiguous vector with h, say, am- 
biguous positions, and i? is a known resolved vector which equals one of 
the 2^ potential resolutions of vector A (where each of the ambiguous 
positions in A is set to either 0 or 1). Then infer that A is the conflation 
of one copy of resolved vector R and another (uniquely determined) re- 
solved vector NR. All the resolved positions of A are set the same in NR, 
and all of the ambiguous positions in A are set in NR to the opposite of 
the entry in R. Once inferred, vector NR is added to the set of known 
resolved vectors, and vector A is removed from the vector set. 

For example, if A is 0212 and R is 0110, then NR is 0011. The interpretation 
is that if the two haplotypes in a screened individual are 0110 and 0011, then the 
observed genotype would be 0212. The inference rule resolves the vector 0212 
using the belief that 0110 is a haplotype in the population, to infer that 0011 is 
also a haplotype in the population. 
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When the Inference Rule can be used to infer the vector NR from the vectors 
A and R, we say that R can be applied to (resolve) A. It is easy to determine 
if a resolved vector R can be applied to resolve an ambiguous vector A: R can 
be applied to A if and only if A contains no resolved position s such that the 
values of A and R differ at position s. A resolved position s in A whose value 
does differ from that in R is said to block the application of R to A. For example, 
0110 can not be applied to 2012 because position two (from the left) blocks the 
application. 

Clark’s entire algorithm for resolving the set of genotypes is to first identify 
the initial resolved set, and then repeatedly apply the Inference Rule until either 
all the genotypes have been resolved, or no further genotypes can be resolved. 

The implicit genetic model (I believe) behind the Inference Rule is that the 
genotypes of individuals in the current population resulted from random mating 
of the parents of the current population. The sampled individuals are also drawn 
randomly from the population, and the sample is small compared to the size of 
the whole population, so the initial resolved vectors likely represent common 
haplotypes that that appear with high frequency in the population. Hence these 
haplotypes are likely to have been used in the creation of currently unresolved 
genotypes. So, if an unresolved genotype A can be explained the by conflation of 
two initial resolved haplotypes, or by using one of the initial resolved haplotypes, 
it is sensible to resolve A in that way, and then to deduce that vector NR is 
also in the population. We can define the “distance” of an inferred haplotype 
NR from the initial resolved vectors, as the number of inferences used on the 
shortest path of inferences from some initial resolved vector, to vector NR. The 
above explanation for the correctness of an Inference Rule becomes becomes 
weaker as it is used to infer vectors with increasing distance from the initial 
resolved vectors. However, Clark’s Inference Rule is “globally” justified in [6] by 
an additional empirical observation that will be discussed shortly. 

Note that in the application of the Inference Rule, there may be choices 
for vectors A and R, and since A is removed once it is resolved, a choice that is 
made at one point can constrain future choices. Hence, one series of choices might 
resolve all the ambiguous vectors in one way, while another execution, making 
different choices, might resolve the vectors in a different way, or leave orphans, 
ambiguous vectors that cannot be resolved. For example, consider a problem 
instance consisting of two resolved vectors 0000 and 1000, and two ambiguous 
vectors 2200 and 1122. Vector 2200 can be resolved by applying 0000, creating 
the new resolved vector 1100 which can then be applied to resolve 1122. That 
execution resolves both of the ambiguous vectors and ends with the resolved 
vector set 0000, 1000, 1100 and 1111. But 2200 can also be resolved by applying 
1000, creating 0100. At that point, none of the three resolved vectors, 0000, 1000 
or 0100 can be applied to resolve the orphan vector 1122. 

The problem of choices is addressed in [6] by using an implementation of 
the method where the choices are affected by the ordering of the data. For any 
input, the data is reordered several times, the method is rerun for each ordering, 
and the “best” solution over those executions is reported. Of course, only a tiny 
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fraction of all the possible data orderings can be tried. We will refer to this as 
the local inference method. 

Without additional biological insight, one cannot know which execution (or 
data ordering) gives the solution that is nearest to the truth. However, simu- 
lations discussed in [6] show that the inference method tends to produce the 
wrong vectors only when the execution also ends with ambiguous vectors that 
cannot be resolved. That is, there is some “global” structure to the set of true 
haplotypes that underly the observed genotypes, so that if some early choices 
in the method incorrectly resolve some of the genotypes, then the method will 
later become stuck, unable to resolve the remaining genotypes. The exact na- 
ture of this global structure was not made explicit, but simulations reported in 
[6] confirmed this expectation. Executions of the method that resolved all the 
ambiguous genotypes, more accurately found the original haplotypes than did 
executions that got stuck. Clark, therefore recommended that his method should 
be run numerous times, randomly reordering the input data each time, and then 
the execution that resolved the most genotypes should be the one most trusted. 



2.1 The MR Problem 

Given what was observed and proposed in [6] , the major open algorithmic ques- 
tion from [6] is whether efficient rules exist to break choices in the execution of 
Clark’s algorithm, so as to maximize the number of genotypes it resolves. This 
leads to the problem studied in [16,17] 

Maximum Resolution (MR) : Given a set of vectors (some ambiguous 
and some resolved), what is the maximum number of ambiguous vectors 
that can be resolved by successive application of Clark’s Inference Rule? 

Stated differently, given a set of vectors, what execution of the inference method 
maximizes the number of ambiguous vectors that are resolved? We want to 
answer this question, rather than rely on re-running the method many times, 
sampling only a miniscule fraction of the possible executions. An algorithm to 
solve the MR problem needs to take a more global view of the data, than does 
the more local inference method, to see how each possible application of the 
Inference Rule influences choices later on. 

Unfortunately, in [17], we show that the MR problem is NP-hard, and in 
fact, Max-SNP complete. The reduction also shows that two variants of the MR 
problem are NP-hard. We next reformulated the MR problem as a problem on 
directed graphs, with an exponential time (worst case) reduction. We will detail 
this approach below. That graph-theoretic problem can be solved via integer 
linear-programming. Experiments with this approach suggest that the reduction 
is very efficient in practice, and that linear programming alone (without explicit 
reference to integrality) often suffices to solve the maximum resolution problem. 
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2.2 A Graph-Theoretic View of the MR Problem 

Given the NP-hardness and Max-SNP completeness of the MR problem, we 
would like a “heuristic” method that feels likely to perform well in practice. 
That algorithm might not always find the optimal solution to the MR problem, 
but it should correctly know when it has actually found the optimal and when 
it has not. To do this, we need an algorithm that takes a more global view of 
the data before deciding on where to apply the Inference Rule. One approach is 
to translate the MR problem (via a worst-case exponential-time reduction) to a 
graph problem as follows. 

We create a directed graph G containing a set of nodes N{A), for each am- 
biguous vector A in the input, and a set of nodes, 7, containing one node for 
each resolved vector in the input. In detail, for each ambiguous vector A, with 
say h ambiguous positions, R{A) is the set of the 2^ distinct, resolved vectors 
created by setting the ambiguous positions in A (to zero or one) in all possible 
ways. N{A) is a set of 2^ nodes, each labeled by a distinct vector in R{A). Note 
that two nodes (in different N{) sets) can have the same label, but the nodes 
are distinguished by the ambiguous vector they originate from. Then connect 
a directed edge from any node v to any node v' in G if and only v' is in a set 
R{A) for some ambiguous vector A (i.e., v' is not in I), and the application of 
resolved vector labeling v to the ambiguous vector A would create the resolved 
vector labeling v\ 

The application of a resolved vector to an ambiguous vector (if possible) 
uniquely determines the inferred resolved vector. Hence, for any ambiguous vec- 
tor A and any node v in G, there is at most one edge from v to the set of nodes 
N{A). Therefore, any directed tree in G rooted a node v G I specifies a feasible 
history of successive applications of the Inference Rule, starting from node v G I. 
The non-root nodes reached in this tree specify the resolved vectors that would 
be created from ambiguous vectors by this succession of applications. Therefore, 
the MR problem can be recast as the following problem on G. 

The Maximum Resolution on G (MRG) Problem Find the largest 
number of nodes in G that can be reached by a set of node-disjoint, 
directed trees, where each tree is rooted at a node in 7, and where for 
every ambiguous vector A, at most one node in N(A) is reached. 

Despite the exponential worst-case blowup, this formulation of the MR prob- 
lem is appealing because the construction of G in practice is likely to be efficient, 
and because without the last constraint, the MRG problem is trivial. The con- 
struction is efficient because it can be implemented to avoid enumerating isolated 
nodes of G, and the expected number of ambiguous positions in any vector is 
generally small. Let G denote graph derived from G where every node that can’t 
be reached from 7 has been removed, along with any incident edges. The im- 
plementation given in [17] constructs G in time proportional to mn^ plus the 
number of edges in G, which in practice is a very small fraction of the number 
of edges in G. 
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2.3 An Exact Integer Programming Formulation for the MRG 
Problem 

The MRG problem can first be formulated by adding simple non-linear con- 
straints to a network flow formulation. First, let all the edges in G be given a 
huge capacity (larger than the number of genotype in the input). Then, add a 
source node and sink node to G; direct an edge of infinite capacity from the 
source node to each node in I ; direct an edge with capacity one from each node 
not in I to the sink node. Clearly, a feasible solution to the MRG problem that 
reaches q nodes (and hence resolves q ambiguous vectors) defines a source-sink 
flow of value q exactly. However, the converse does not yet hold, since we have 
not excluded the possibility of reaching more than one node in any set N(A). 
For that, consider a linear programming formulation (for background see [27,32]) 
of the above network flow problem, and let Xe denote the variable for the flow 
on edge e. Then for every pair of edges e, e' that enter nodes in some set N{A), 
add in the non-linear constraint x^Xe' = 0 to the network flow formulation. An 
integer solution to this mathematical program exactly solves the MRG problem. 
But since this formulation involves non-linear constraints, it is not clear how we 
would solve it in practice. 

In [17], we explored a different integer programming formulation, that is not 
exact, but that found the exact solution most of the time. However, recently 
R. Ravi [33] suggested how to reformulate the non-linear constraints as linear, 
integer constraints, which results in making the above formulation an exact 
integer, linear-programming formulation for the MRG problem. 

Let Ce and Cg' denote the capacities on edges e and e' respectively. Ravi 
suggested replacing the constraint XeXe' = 0, with Xe < de, Xe' < Cgde' and 
de + Ce'de' < 1, where de and de> are 0,1 variables. Since Cg and Cg' are constants, 
these three inequalities are linear, and their effect is to limit the flow to at most 
one of the edges e or e'. 



2.4 Results 

The maximum-resolution hypothesis in [6] is that the most accurate solutions 
tend to come from the executions of the inference method that resolve the most 
ambiguous vectors. In the simulations done in [17] we verify the basic maximum- 
resolution hypothesis. We consistently observed that the executions with the 
most correct resolutions were the ones with the most resolutions. More informa- 
tively, the ratio of correct resolutions to total resolutions increases as the total 
number of resolutions increases, and the distribution of the number of correct 
resolutions tends to be more skewed to the right, as the number of resolutions 
increases. These simulations are consistent with the basic maximum-resolution 
hypothesis. However, maximum-resolution alone is not a sufficient guide to find- 
ing the largest number of correct resolutions, because the distribution of the 
number of correct resolutions have high variance, even among those executions 
that resolve the maximum number of vectors. 
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So, in order to maximize the number of correct resolutions, it is indeed best 
to select from those executions that maximize the total number of resolutions 
(as the maximum-resolution hypothesis states), but in order to decide which 
of those execution(s) to use, one still needs some additional criteria. The most 
effective secondary criteria to use, in order to find solutions with a large number 
of correct resolutions, is to minimize the number of distinct haplotypes used 
in the solution. That is, if we first restrict attention to those executions that 
maximize the number of resolutions, and then within that group of executions, 
restrict attention to the ones that use the smallest number of distinct haplotypes, 
the quality of the solution greatly improves. This will be further discussed in the 
next section. 



3 Supercharging Clark’s Method 

The observations reported above suggest that maximum resolution is not enough 
in order to find the most accurate solution when using Clark’s method. Certainly, 
Clark’s method should be run many times, as Clark suggested, but it is not clear 
how to use the full set of results obtained from these executions. (Interestingly, 
most published evaluations of Clark’s method only run it once, ignoring the 
intuition that Clark had, that the stochastic behavior of his algorithm was an 
important factor.) 

In [30], we examine several variants of Clark’s method, using a set of 80 
genotypes, of which 47 were ambiguous; the others were either homozygous at 
each site or only had a single hetrozygous site. Each genotype contained nine SNP 
sites in the human APOE gene. Independently, the correct underlying haplotypes 
were laboratory-determined in order to calibrate the accuracy of each variation 
of Clarks’s method (accuracy measured by how many of the haplotype pairs 
reported by the algorithm were actually the original haplotype pairs for that 
genotype) . We observed that most variations of Clark’s method produce a large 
number of different solutions, each of which resolved all of the 47 ambiguous 
genotypes. The solutions had a high accuracy variance, and choosing a solution 
at random from among the 10,000 solutions would give a poor solution with high 
probability. Hence, an important issue in using Clark’s method is how to make 
sense, or exploit, the many different solutions that it can produce, and that each 
resolve all of the ambiguous genotypes. 

The main result is that the following strategy works to greatly improve the 
accuracy of Clark’s method. First, for the input genotype data, run Clark’s 
method many times (we used 10,000 times), each time randomizing decisions that 
the method makes. Second, over all the runs, select those runs which produce a 
solution using the fewest or close to the fewest number of distinct haplotypes. 
The number of such runs was typically in the tens. In those tens of runs, for 
each genotype g in the input, record the most used haplotype pair that was 
produced to explain g. The set of such explaining haplotype pairs is called the 
“consensus solution” . We observed that the consensus solution had dramatically 
higher accuracy than the average accuracy of the 10,000 solutions. For example. 
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in the APOE data, in one of the variants, out of the 10,000 executions, there were 
24 executions that used 20 or 21 distinct haplotypes, and twenty was the smallest 
observed number. The average accuracy of the 10,000 executions was 29 correct 
haplotype pairs out of the 47 ambiguous genotypes, and the execution with the 
highest accuracy in the 10,000 had 39 correct pairs. However, the average of the 
24 selected executions was 36. The consensus solution of those 24 executions had 
39 correct pairs. Hence, this simple rule allowed us to home in on a single solution 
that was as good as the most accurate solution over all the 10,000 solutions. In 
another variant of Clark’s method, the consensus solution had 42 correct pairs, 
while the average of all the 10,000 solutions had 19 correct. For comparison, the 
program PHASE always got 42 correct solutions, and the program Haplotyper 
produced a range of solutions with most getting either 43 or 42 correct, with one 
solution getting 44 correct, and three solutions getting only 37 correct. 

We also observed that among the tens of solutions that use few haplotypes, 
any haplotype pair that is used with high frequency, say above 85% of the time, 
was almost always correct. This allows one to home in on those pairs that can 
be used with high confidence. 

4 Perfect Phylogeny Haplotyping 

As mentioned earlier, the haplotype inference problem would be impossible with- 
out some implicit or explicit genetic model guiding the method or selecting 
the most promising solutions. The most powerful such genetic model is the 
population-genetic concept of a coalescent, i.e., a rooted tree that describes the 
evolutionary history of a set of haplotypes in sampled individuals [35,25]. The 
key observation is that “In the absence of recombination, each sequence has a 
single ancestor in the previous generation.” [25]. 

That is, if we follow backwards in time the history of a single haplotype H 
from a given individual I, when there is no recombination, that haplotype H is 
a copy of one of the haplotypes in one of the parents of individual I. It doesn’t 
matter that I had two parents, or that each parent had two haplotypes. The 
backwards history of a single haplotype in a single individual is a simple path, if 
there is no recombination. That means that the history of a set of 2n individuals, 
if we look at one haplotype per individual, forms a tree. The histories of two 
sampled haplotypes (looking backwards in time) from two individuals merge at 
the most recent common ancestor of those two individuals. (The reason for using 
2n instead of n will be clarified shortly.) 

There is one additional element of the basic coalescent model: the infinite- 
sites assumption. That is, the m sites in the sequence (SNP sites in our case) 
are so sparse relative to the mutation rate, that in the time frame of interest at 
most one mutation (change of state) will have occurred at any site. 

Hence the coalescent model of haplotype evolution says that without recom- 
bination, the true evolutionary history of 2n haplotypes, one from each of 2n 
individuals, can be displayed as a tree with 2n leaves, and where each of the 
m sites labels exactly one edge of the tree, i.e., at a point in history where a 
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mutation occurred at that site. This is the underlying genetic model that we 
assume from here on. Note that we may not know the ancestral haplotype at the 
root of the tree, although the algorithms can be somewhat simplified when we 
do know it. See [35] for another explanation of the relationship between sequence 
evolution and coalescents. 

In more computer science terminology, the no-recombination and infinite- 
sites model says that the 2n haplotype (binary) sequences can be explained by 
a perfect phytogeny [14,15] which is defined next. 

Definition Let M be an 2n by m 0-1 (binary) matrix. Let V be an m-length 
binary vector, called the ancestor vector. 

A perfect phytogeny for M and 1^ is a rooted tree T with exactly 2n leaves 
that obeys the following properties: 

1) Each of the 2n rows labels exactly one leaf of T, and each leaf is labeled 
by one row. 

2) Each of the m columns labels exactty one edge of T. 

3) Every interior edge (one not touching a leaf) of T is labeled by at teast 
one column. 

4) For any row i, the value is unequal to V{j) if and only if j labels 

an edge on the unique path from the root to the leaf labeled i. Hence, that path 
is a compact representation of row i. 

The biological interpretation is that an edge label j indicates the point in 
time where a mutation at site j occurred, and so the state of site j changes 
from its ancestral value to the opposite value. The motivation for the perfect 
phylogeny model is based on recent observations of little or no recombination in 
long segments of Human DNA, and the standard infinite-sites assumption. 

In the rooted version of perfect phylogeny, V is given as input. There is also 
an unrooted version of perfect phylogeny, where V is not specified. In that case, 
a binary matrix M is said to have a perfect phylogeny if there exists a V such 
that there is a (rooted) perfect phylogeny for M and V. 

Formally, the Perfect Phylogeny Haplotype (PPH) Problem is: Given 
a set of genotypes, M, find a set of explaining haplotypes, M', which defines an 
unrooted perfect phylogeny. 



What Happened to the Genotype Data? 

How does the perfect phylogeny view of haplotypes relate to the problem of 
deducing the haplotypes when only the n genotype vectors are given as input? 

The answer is that each genotype vector (from a single individual in a sample 
of n individuals) was obtained from the mating of two of 2n haplotype vectors 
in an (unknown) coalescent (or perfect phylogeny). That is, the coalescent with 
2n leaves is the history of haplotypes in the parents of the n individuals whose 
genotypes have been collected. Those 2n haplotypes are partitioned into pairs, 
each of which gives rise to one of the n observed genotypes. 

So, given a set S' of n genotype vectors, we want to find a perfect phylogeny 
T, and a pairing of the 2n leaves of T which explains S. In addition to efficiently 
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finding one solution to the PPH problem, we would like to determine if that is 
the unique solution, and if not, we want to efficiently represent the set of all 
solutions, so that each one can be generated efficiently. 

4.1 Algorithm and Program GPPH 

The PPH problem was introduced and first solved in [18]. The algorithm given 
in [18] is based on reducing the PPH problem to a well-studied problem in graph 
theory, called the graph realization problem. The theoretical running time of this 
approach is 0{nma{nm)), where a is the inverse Ackerman function, usually 
taken to be a constant in practice. Hence, the worst case time for the method 
is nearly linear in the size of the input, nm. The reduction in [18] has a small 
error, and a corrected, simplified reduction is detailed at: 
wwwcsif.cs.ucdavis.edu/~gusfield/recomberrata.pdf. 

The time for the reduction is 0{nm), and the graph realization problem can 
be solved by several published methods. The method in [3] is based on a general 
algorithm due to Lofgren, and runs in 0{nma(nm)) time. That algorithm is the 
basis for the worst-case time bound established in [18], but we found it to be 
too complex to implement. In [18] it was explained that after one PPH solution 
is obtained, by whatever method, one can get an implicit representation of the 
set of all PPH solutions in 0{m) time. 

Using a different solution to the graph realization problem [13], this approach 
yields a time bound of 0{nm?), and that approach has been implemented in a 
program now called GPPH [5] . GPPH contains within it a fully general solution 
to the graph realization problem, even for instances that do not arise from any 
PPH problem instance. 

4.2 Algorithm and Program DPPH 

The second method to solve the PPH problem is called the DPPH method. 
The method in DPPH [1,2] is not based (explicitly or implicitly) on a graph 
realization algorithm, but is based on deeper insights into the combinatorial 
structure of the PPH problem and its solution. The running time for the method 
is also 0{nm?), and the algorithm produces a graph that represents the set of 
all solutions in a simple way. That approach has been implemented and is called 
DPPH. 

4.3 Algorithm and Program HPPH 

A third method to solve the PPH problem was developed in [11]. It also has 
worst-case running time of 0{nm?), and it can be used to find and represent 
all the solutions. This approach has been implemented and is called HPPH. Al- 
though developed independently of GPPH, one can view the method in HPPH as 
a specialization of graph realization method used in GPPH to the PPH problem, 
simplifying the general purpose graph realization method to problem instances 
coming from the PPH problem. 
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4.4 Comparing the Execution of the Programs 

All three of the programs GPPH, DPPH and HPPH are available at 
http : / / wwwcsif . cs.ucdavis.edu / ~ gusfield / . 

The three programs have been extensively tested on the same datasets. Before 
the methods were tested against one another, we expected that DPPH would 
be the fastest, and that HPPH would be faster than GPPH. The reason is the 
DPPH exploits the deepest insights into the special combinatorial structure of 
the PPH problem, GPPH is the most general, and HPPH can be viewed as a 
simplification of GPPH obtained by exploiting some structural properties of the 
PPH problem. 

The empirical testing of the programs exactly matched our expectations, and 
details can be found in [4]. The empirical testing also uncovered two interesting 
phenomena that we discuss next. 



Uniqueness of the Solution: a Strong Phase Transition 

For any given input of genotypes, it is possible that there will be more than one 
PPH solution. When designing a population screen and interpreting the results, 
a unique PPH solution is very important. So the question arises: for a given 
number of sites, how many individuals should be in the sample (screen) so that 
the solution is very likely to be unique? This is a question that was raised in [18]. 
Therefore, we did several experiments which determine the frequency of a unique 
PPH solution when the number of sites and genotypes changes. Intuitively, as 
the ratio of genotypes to sites increases, the probability of uniqueness increases. 
We studied precisely this issue, and the striking observation is that there is a 
strong phase transition in the frequency of unique solutions as the number of 
individuals grows. That is, the frequency of unique solutions is close to zero for 
a given number of individuals, and then jumps to over 90% with the addition of 
just a few more individuals. 



Handling Haplotypes Generated by Some Recombinations 

The PPH problem is motivated by the coalescent model without recombina- 
tion. However, the programs can be useful for solving the HI problem when the 
underlying haplotypes were generated by a history involving some amount of 
recombination. In that case, it is not expected that the entire data will have a 
PPH solution, but some intervals in the data might have one. We can use one 
of the PPH programs to find maximal intervals in the input genotype sequences 
which have unique PPH solutions. Starting from position 1, we find the longest 
interval in the genotype data which has a unique PPH solution. We do this using 
binary search, running a PPH program on each interval specified by the binary 
search. Let us say that the first maximal interval extends from position 1 to 
position i. We output that interval, and then move to position 2 to determine if 
there is an interval that extends past i containing a unique PPH solution. If so, 
we find the maximal interval starting at position 2, and output it. Otherwise, 
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we move to position 3, etc. We continue in this way to output a set of maximal 
intervals, each of which contains a unique PPH solution. This also implicitly 
finds, for each starting position, the longest interval starting at that position 
that contains a unique PPH solution. 

In principle, the intervals that are output could overlap in irregular, messy 
ways. However, we have observed that this is rarely the case. Generally, the 
output intervals do not overlap, or two intervals overlap at one site, i.e., the 
right end of one interval may overlap in one position with the left end of the 
next interval. This provides a clean decomposition of the data into a few intervals 
where in each, the data has a unique PPH solution. 

The most striking thing is that when the recombination rate is moderate, 
the accuracy of the PPH solutions inside each interval, compared to the original 
haplotypes, is very high. There are many ways that such a decomposition can 
be used. The most obvious is to reduce the amount of laboratory work that is 
needed to fully determine the correct haplotypes. For example, in a problem 
with 100 sites and 10 intervals, we can form new shorter genotype vectors based 
on one site per interval, hence 10 sites. If the correct haplotype pairs for these 
shorter genotype sequences are determined, we can combine that information 
with the (assumed correct) haplotype pairs determined in each interval by a 
PPH program. The lab effort is reduced to one tenth of what it would be to 
determine the haplotypes from 100 sites. That is a huge reduction in laboratory 
effort. 

5 Pure Parsimony 

One natural approach to the HI problem that is often mentioned in the literature 
is called the Pure- Par simony approach^: Find a solution to the HI problem that 
minimizes the total number of distinct haplotypes used. 

For example, consider the set of genotypes: 02120, 22110, and 20120. There 
are HI solutions for this example that use six distinct haplotypes, but the so- 
lution 00100, OHIO; OHIO, 10110; 00100, 10110, for the three genotype vectors 
respectively, uses only the three distinct haplotypes 00100, OHIO, and 10110. 

The Pure parsimony criteria reflects the fact that in natural populations, 
the number of observed distinct haplotypes is vastly smaller than the number of 
combinatorially possible haplotypes, and this is also expected from population 
genetics theory. We also saw this in the experiments reported in [17]. More- 
over, the parsimony criteria is to some extent involved in existing computational 
methods for haplotype inference. For example, some papers have tried to explain 
Clark’s method [6] in terms of parsimony [29], although the role of parsimony is 
not explicit in the method, and the haplotyping program PHASE [34] has been 
explained in terms of the parsimony criteria [9]. However, the indirect role of 
the parsimony criteria in those methods, and the complex details of the com- 
putations, makes it hard to see explicitly how the parsimony criteria influences 

^ This approach was suggested to us Earl Hubbell, who also proved that the problem 
of finding such solutions is NP-hard [24]. 
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the computation. This makes it difficult to use those methods to evaluate the 
effectiveness of the parsimony criteria as a genetic model. 

In [19] we detail how to compute, via integer-linear-programming, an HI 
solution that minimizes the number of distinct haplotypes, i.e., is guaranteed 
to solve the Pure-Parsimony problem. However, the worst-case running time 
increases exponentially with the problem size, so the empirical issue is whether 
this approach is practical for problem instances of current interest in population- 
scale genomics. The paper shows wasy to improve the practicality of the basic 
integer programming formulation in a way that is very effective in practice. 
Extensive experimentation was done to show the time and memory practicality 
of the method, and to compare its accuracy against existing HI methods that 
do not explicitly follow the Pure-parsimony criteria. 

Empirically, the end result is that for haplotyping problem instances of cur- 
rent interest, Pure-parsimony can be computed efficiently in most cases. How- 
ever, it’s accuracy is somewhat inferior to the solutions produced by the program 
PHASE, although this depends on the number of sites and the level of recombi- 
nation. 

In more detail, the practicality and accuracy of our approach depend on the 
level of recombination in the data (the more recombination, the more practical 
but less accurate is the method). We show here that the Pure-Parsimony ap- 
proach is practical for genotype data of up to 30 sites and 50 individuals (which 
is large enough for practical use in many current haplotyping projects). Up to 
moderate levels of recombination, the haplotype calls are 80 to 95 percent cor- 
rect, and the solutions are generally found in several seconds to minutes, except 
for the no-recombination case with 30 sites, where some solutions require a few 
hours. 

These results are a validation of the genetic model implicit in the Pure- 
Parsimony objective function, for a Purely randomly picked solution to the HI 
problem would correctly resolve only a minuscule fraction of the genotypes. 

Recently the paper [36] gave experimental results on a pure parsimony ap- 
proach to the haplotype inference problem, solving it by branch and bound 
instead of integer programming. Theoretical results on pure parsimony appear 
in [26]. 



6 Adding Recombination into the Model 

The perfect phylogeny haplotyping problem is motivated by the assumption that 
in some interesting cases (haplotype blocks for example) the evolution of the 
underlying haplotypes can be represented on a perfect phylogeny. To increase the 
applicability of the model, we want to relax that stringent assumption. This was 
done in a heuristic way in [10] where they observed that the haplotypes reported 
in [8] cannot generally be derived on a perfect phylogeny, but with the removal 
of only a small number of individuals, the remaining sequences can be derived 
on a perfect phylogeny. Those observations validate the underlying, ideal perfect 
phylogeny model and the utility of having an efficient, clean solution for the 
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PPH problem, but they also highlight a need to introduce more robust models of 
haplotype evolution into the haplotyping model. Probably, the most important 
modification of the perfect phylogeny model would be to allow sequences to 
recombine, as is common in the real evolution of sequences in populations. When 
recombination is allowed, the history of the sequences no longer fits a tree-like 
structure, but rather a network is required to represent the derivation of the 
haplotypes. 

Once recombination is allowed into the underlying model, we can define the 
formal analog of the PPH problem: given a set of genotypes whose underlying 
haplotypes were believed to have evolved on a network with recombination, 
find a set of explaining haplotypes (for the genotypes) which can be derived 
on a phylogenetic network with a small amount of recombination. We call this 
the “Phylogenetic Network Haplotyping (PNH) Problem” . By specifying a small 
amount of recombination, we limit the number of distinct haplotypes that can be 
created on the network. As discussed earlier, the number of haplotypes observed 
in real populations tends to be relatively small. 

The solutions to the PPH problem exploit basic theorems about when a set 
of haplotypes can be derived on a perfect phylogeny. Those theorems are rela- 
tively simple to state and crucial to the PPH solutions. The PPH problem has 
a clean solution partly because of the simplicity of the necessary and sufficient 
condition for a set of haplotypes to be derived on a perfect phylogeny. The PNH 
problem is harder, because we have no analogous, simple theorems about phylo- 
genetic networks. Hence, we have recently focussed attention on the derivation 
of haplotypes on phylogenetic networks, with the ultimate goal of applying what 
we have learned to the PNH problem. We have focussed mostly on phylogenetic 
networks with constrained recombination. Results to date can be found in in 
[ 21 , 20 , 22 ]. 
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Abstract. It is widely anticipated that the study of variation in the 
human genome will provide a means of predicting risk of a variety of 
complex diseases. Single nucleotide polymorphisms (SNPs) are the most 
common form of genomic variation. Haplotypes have been suggested as 
one means for reducing the complexity of studying SNPs. In this paper 
we review some of the computational approaches that have been taking 
for determining haplotypes and suggest new approaches. 



1 Introduction 

Genomes can be considered to be a collection of long strings, or sequences, from 
the alphabet |A,C,G,T}. Each element of the alphabet encodes one of four pos- 
sible nucleotides. With the completion of the sequencing of the human genome, 
efforts are underway to catalogue genomic variations across human populations. 
Single Nucleotide Polymorphisms or SNPs constitute a large class of these vari- 
ations. A SNP is a single base pair position in genomic DNA at which different 
nucleotide variants exist in some populations; each variant is called an allele. In 
human, SNPs are almost always biallelic; that is, there are two variants at the 
SNP site, with the most common variant referred to as the major allele, and the 
less common variant as the minor allele. Each variant must be represented in a 
significant portion of the population to be useful. 

Diploid organisms, such as humans, possess two nearly identical copies of 
each chromosome. In this paper, we will refer to a collection of SNP variants 
on a single chromosome copy as a haplotype. Thus, for a given set of SNPs, an 
individual possesses two haplotypes, one from each chromosome copy. A SNP 
site where both haplotypes have the same variant (nucleotide) is called a ho- 
mozygous site; a SNP site where the haplotypes have different variants is called 
a heterozygous site. The conflated (mixed) data from the two haplotypes is called 
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a genotype. Thus, in genotype data, while the nucleotide variants at homozygous 
and heterozygous sites are known, the information regarding which heterozygous 
site SNP variants came from the same chromosome copy, is unknown. See Fig- 
ure 1 for an example of these concepts. Haplotypes play a very important role 
in several areas of genetics, including mapping complex disease genes, genome 
wide association studies, and also in the study of population histories. Unfortu- 
nately, current experimental techniques to infer the haplotype of an individual 
are both expensive and time consuming. However, it is possible to determine the 
genotype of an individual quickly and inexpensively. Computational techniques 
offer a way of inferring the haplotypes from the genotype data. 



Fig. 1. Two sequences from the same region on two nearly identical copies of 
a chromosome of an individual. Only the SNPs have been shown with the non- 
SNP positions labeled with a . In this example there are five SNPs. The first 
and the fourth SNP sites are homozygous, and the remaining three SNP sites 
are heterozygous. The individual has the two haplotypes ACATG and ATGTC; 
the genotype is A{G,T}{A,G}T{G, G}. 



Out of the two nearly identical copies of each chromosome in an individual, 
one copy is inherited from the paternal genome and the other copy from the 
maternal genome. This simple picture of inheritance is complicated by a process 
known as reeombination, which takes place during meiosis - a process involved 
in the formation of reproductive cells (or gametes) in the parents. During re- 
combination, portions of the paternal and maternal chromosomes are exchanged 
(Figure 2). Recombination can result in haplotypes in offspring that are different 
from those in the parents. The site on the chromosome where a recombination 
occurs is called a recombination site. On average, one or two recombinations 
occur per chromosome per generation [33] . 

In population studies, it has been shown that the likelihood that a site will 
act as a recombination site is not uniform across a chromosome [33], recombi- 
nation sites occur much more frequently than expected in certain chromosomal 
regions and much less frequently in other chromosomal regions. Regions of high 
recombination site frequency are called recombination hotspots. Several recent 
studies [9,29,44] have suggested that human genetic variation consists largely of 
regions of low recombination site frequency, delineated by regions of high recom- 
bination site frequency, resulting in blocks of SNPs organized in mini-haplotypes. 

An assumption that underlies much of population genetics, called the infinite 
sites model, requires that the mutation that results in a SNP occur only once 
in the history of a population, and therefore, that all individuals with a variant 
allele must be descendants of a single ancestor. While the infinite sites model 
is clearly a simplification of the true mechanism of genetic mutation, models of 
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Fig. 2. An illustration of the recombination process that occurs during meiosis. 
Recombination is characterized by a cross-over event in which a portion of a 
paternal chromosome is exchanged with a portion of a maternal chromosome. 
This can result in the offspring having different haplotypes from those in the 
parents. 



genetic variation built under this assumption compare favorably with empirical 
population genetics studies. Some of the models and algorithms in the text to 
follow will assume an infinite sites model. 

In this paper we present an overview of some of the computational approaches 
that have been taken for determining haplotypes. This survey is split into two 
parts, first approaches for haplotype phasing are presented and then approaches 
for haplotype assembly. 

2 The Haplotype Phasing Problem 

In this section, we consider the problem of haplotype phasing: Given a set of 
genotypes, find a good set of haplotypes that resolve the set. 

Generically the haplotype phasing problem can be posed as: 

Haplotype Phasing (generic) 

Input: A set G of genotypes. 

Output: A set H of haplotypes, such that for each g G G there exists hi,h 2 G H 
such that the conflation of hi with /12 is g. 

An alternate related problem is haplotype frequency estimation. In this prob- 
lem we care primarily about estimating the frequency of each potential haplotype 
in the population, and less so about the phasings of particular individuals. 

By typing genetically related individuals one can get a better estimate of 
haplotypes present since the haplotype pair of a child is constrained by its in- 
heritance from his parents [35,36]. This version of the problem is considered 
in various software packages [1]. In this paper, we assume that such pedigree 
data is not available to us, however recasting the problems presented here in the 
presence of pedigree data is a worthwhile avenue of research. 

Haplotype phasing has a variety of applications, each of which warrant dif- 
ferent methodologies. Goarsely, one can partition haplotype phasing problems 
into three classes, based on their tractability: 
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Small The number of sites is small enough that solutions requiring exponential 
space or time in it would be practical. It is sufficient for analyzing the SNPS 
in the vicinity of a single gene. 

Medium The number of sites is small enough that methods which are polyno- 
mial in the number of sites and individuals are practical. Number of indi- 
viduals and number of sites may be on the order of lOO's. This size roughly 
corresponds to the number of SNPs across a region spanning several genes. 
Large Chromosome size, where algorithms which are linear in the number of 
SNPs are the only ones practical. The number of sites could be in the tens 
of thousands while the number of individuals sampled is small. 

Additionally, many of the population genetics assumptions that hold for the 
small problems will not extend easily to the medium and large problems where 
the effects of recombination become significant. Different measures of success are 
appropriate depending on the problem size. Given a set of genotypes with a priori 
phasing information, a natural questions to ask is whether the algorithm retrieves 
the correct phasing. For small and medium problems, appropriate measures in- 
clude the number of haplotypes that are predicted correctly or the difference in 
population frequency of the haplotypes in the known and the predicted set. For 
very large problems it is likely that these measures will be blunt and all methods 
will not perform well. An alternate measure suggested in [37] is the number of 
crossovers to explain the correct haplotypes from the predicted haplotypes. 

When presenting the problems, we will assume that the genotype information 
we have is accurate. However, in practice, this is not the case, current genotyping 
technologies will fairly frequently not call genotypes (missing data) and less 
frequently miscall a genotype (wrong data) . A practical algorithm needs to deal 
with these problems, in particular the missing data problem. The discussion 
in this paper is in terms of SNP’s, most of the results and methods also will 
apply, perhaps with some modification, to studies of alternate genetic variations 
(markers) such as microsatellites. 

Notation We will follow notation by Gusfield [19] for haplotypes and genotypes. 
We will arbitrarily label the two alleles of any SNP 0 and 1. A genotype, repre- 
senting a pair of haplotypes, can take three values for each SNP, corresponding 
to the observation of {0}, {!}, {0, 1}. To simplify notation we will use 0 for {0}, 
1 for {1} and 2 for {0, 1}. We will say that a SNP is ambiguous in a genotype if 
it has value 2. A genotype is ambiguous if it contains more than one ambiguous 
SNP. 

We will generally use subscripts for objects associated with haplotypes and 
superscripts for objects associated with genotype. For example, the probability 
of observing the genotype g in a given population might be given as (f>^ and 
the haplotype probabilities as (j)h- Since superscripts are possibly confused with 
exponentiation, explicit parentheses will be placed around exponentiated quan- 
tities to disambiguate this. 

We will use -I- to denote conflation and write h + h = g if the conflation 
of h and h is g. To capture the notion that two haplotypes combine to make a 
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genotype, we will, when convenient to do so, use the Kronecker delta, = 1 
if h + h = g and 0 else. 

We will denote the number of genotypes with n and the number of SNP sites 

with TO. 



2.1 Clark’s Rule 

In a seminal paper [7], Clark proposed a common sense approach to phasing, 
that has become known as Clark’s rule. Clark’s rule is an inference method that 
resolves genotypes to their haplotype pairs. First, all homozygous and single am- 
biguous site genotypes are identified. The haplotypes that phase these genotypes 
are completely determined, forming an initial set of haplotypes supported by the 
data. Given a set of haplotypes H representing the resolved genotypes, Clark’s 
rule finds g G G and h G H such that g = h + h for some h. The haplotype h is 
added to H. The process continues until either all genotypes are resolved, or no 
suitable pair of unresolved genotype and resolving haplotype {g, h) exists. 

Note that it may not even be possible to get this algorithm started if there 
are no homozygous or single ambiguous site genotypes. Further, there is no 
guarantee that a particular sequence of applications of Clark’s rule will resolve 
all genotypes. Genotypes that remains unresolved after a maximal sequence of 
applications of Clark’s rule are called orphans. 

It should be clear from the description of Clark’s rule that it describes a class 
of algorithms, each of which uses a different protocol for selecting a genotype- 
haplotype pair from which to infer a (typically) new haplotype. Clark’s paper 
applies a greedy approach, in which the known haplotypes are tested against 
the unresolved genotypes in turn. The first genotype that Clark’s rule can be 
applied to is resolved, potentially adding a new haplotype to the set of known 
haplotypes for the next iteration. 

It is natural to ask for a Clark’s rule application sequence that results in the 
fewest number of orphans. Clark’s experiments [7] on real and simulated data 
suggest that the sequence of applications of Clark’s rule that resolves the most 
genotypes generates fewest incorrect haplotype assignments. 

Problem 1 (Minimizing Orphans). Find a sequence of Clark’s rule applications 
that results in the fewest orphans. 

Biological intuition about the nature of haplotypes present in human popu- 
lations prompt us to think about versions of problem 1 that produce solutions 
that respect this intuition. 

Problem 2 (Maximizing Unique Resolutions). Find a sequence of Clark’s rule 
applications that maximizes the number of resolutions subject to the constraint 
that the final set of haplotypes must provide a single unique resolution to each 
genotype. 
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Problem 3 (Minimizing Inference Distance). Find a sequence of Clark’s rule ap- 
plications that minimizes the number of Clark’s rule applications necessary to 
generate the genotypes’ haplotypes. 

Gusfield [19,20] studied a slightly restricted version of this problem, in which 
each genotype can participate in at most one Clark’s rule application. Gusfield 
showed that finding an optimal Clark’s rule application sequence is NP-hard, 
but that in practice, on medium-sized instances, this version of the problem 
can be solved by a combination of careful enumeration and linear programming. 
Gusfield also evaluated the effectiveness of an algorithm incorporating a greedy 
application of Clark’s rule with mixed results. 

2.2 Maximum Likelihood 

Hardy- Weinberg equilibrium (HWE) is the condition that the probability of 
observing a genotype is equal to the product of the probabilities of observing 
its constituent haplotypes (see [23]). Under this hypothesis, the probability of 
genotype g in the population is related to the haplotype probabilities by the 
compact expression 

h-\-h—g 

where 4>h is the probability of haplotype h in the population. 

The maximum likelihood method of [15,24,39,52,16,1] estimates the haplotype 
probabilities 4>h = {<Ph, 4 ’Tit ■ ■ ^ 4>h') from observed genotype frequencies <j)^ in n 
individuals. The approach assumes HWE and a uniform prior on the 0^’s. The 
likelihood function of the observed is then 

= (1) 
geG 

where 0® = 4>h(l>h- The estimated (fn is a maximum of L subject to the 

constraints that f and 0^ > 0, V/i G H. 

There is a great deal of literature on the maximization of this polynomial, 
for example the method of Expectation Maximization is a linearly convergent 
method guaranteed to locate a local maximum of L from almost every (feasible) 
starting point. 

However, a naive implementation of the EM method requires exponential 
space, since there are 2™ unknown haplotype probabilities which must be stored 
for m variant sites. One notes note that, for n sampled individuals, I7(n) hap- 
lotypes are expected to have significant probability. An efficient way to discover 
those haplotypes which contribute significantly to the maximizer of L would 
make this approach much more efficient. 

Problem 4 (Haplotype Support Problem). Given observed genotype frequencies 
03, and e > 0, find H' C H, such that one can guarantee that there exists a 4>h 
that is a global maximizer of L and that h ^ H' implies 0?i < e. 
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The Phasing Polynomial We will now give a combinatorial interpretation of 
L. We assume that comes from counts of individual observed genotypes, and 
thus n(j)^ is integral for each genotype g. We may then formulate L in terms of 
a product over n observed individual genotypes gi {1 < i < n), i.e. 

n n 

i=l i=l h+h=gi 

Interchanging product and summation this becomes 

hiih2,-"h2n 



Let an explanation of the genotypes g = {gi, . . . ,gn) be a sequence of 2n hap- 
lotypes h = {hi,h 2 , ■ ■ ■ , ^ 2 n) such that /i 2 i-i + ^ 2 i = gi- Then the polynomial 
above can be more compactly expressed as 

L= ^ 4'hi4’h2 ■ ■ ■ 4’h2r^ 

h explains g 

with the sum ranging over all explanations of g. The likelihood function is a 
polynomial with a term of coefficient 1 for each possible explanation of the ob- 
served genotypes. Thus, a solution to the genotype phasing problem corresponds 
to a particular term in this polynomial. 

The maximum likelihood approach seeks frequencies <j)H which maximize L. 
This problem is known to be NP-hard [26]. Also note that the problem does 
not directly address the problem of computing the individual phasings for each 
genotype. However, approximations can be made which recover the combinato- 
rial nature of the phasing problem. 



A Discrete Approximation Let us collect the terms of L, and use a multi- 
index P (a vector of non-negative integers indexed by iL) to keep track of the 
exponents, then 



L = Y^K{P,g) 

P hGH 

where K{P,g) denotes the number of explanations of the observed genotype 
counts g which have P haplotype counts. 

Since the (f)h are constrained to lie between 0 and 1, most of the terms in L 
are expected to be small. We may approximate L with its largest term: 

L ~ Lmax = nmx | K (P, g) n 

I h&H 
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The maximization of Lmax with respect to the iph is trivial, since any monomial 
in probabilities, (j)h, is maximized by (j)h = Ph/{2n). Thus 

ta&xL Lmax = max i (2n)“^”A:(P,g) TT(P/i)^'* 

p [ Y 

Thus, we see that the maximization of the maximal term of the likelihood poly- 
nomial reduces to a discrete problem. The solution of this problem does not 
give a phasing, but a collection of possible phasings with identical counts. The 
solution may also be a good initial point for an iterative maximum likelihood 
method, such as expectation maximization. 

The objective function in this optimization problem is 

F{P,Ng^) = K{P,G)\[{Phf\ 

h 

where Pi = 2N, which counts the number of ways to select 2N haplotypes 
from a bag with counts P with replacement to form an explanation of the geno- 
types G. 

We are not aware of any results about the complexity of evaluating F or its 
maximum. In fact, there is a feasibility problem to which we have found no easy 
answer as well. 

Problem 5 (Haplotype Count Feasibility) . Given genotypes g = (gi, . . . ,g„) and 
a vector of counts P over H, decide whether there exists an explanation of g 
with counts P. 



Problem 6 (Counting Arrangements K{P,g)). Given genotypes g = ((/i, . . . , (/„) 
and a vector of counts P, count how many distinct explanations, 
h = (hi, /i 2 , • ■ • , h 2 n-i, h 2 n), exist for g with counts P. 

Problem 1 (Maximizing Arrangements). Given g = {gi, . . . ,gn), find counts P, 
such that K{P,g) is maximized. 



Links to Clark’s Rule One method for breaking ties in the application Glark’s 
rule is to allow the haplotype frequencies to serve as probabilities, and randomly 
select g’s and h’s to which to apply it. In such a scheme, one would still resolve the 
homozygotes and the single-site heterozygotes, since they are unambiguous, but, 
when faced with a choice between multiple phasings, one randomly selects the 
phasing h,h with probability (f)h()h/ ■ Since this procedure is still strongly 
dependent on the order of consideration for the ambiguous genotypes, one draws 
them, and re-draws them, uniformly at random, from the sampled individuals. 

This process can be viewed as a means to generate a sample from a stationary 
point of the maximum likelihood function. To see this, we view the individual 
samples as all having some phase, which we rephase through random applications 
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of Clark’s rule with random tie-breaking as above. In the continuum limit, new 
instances of haplotype h are introduced at a rate 

^<f)h = (2) 

g,h:h-\-h—g 

where = ^h+h=g while instances of haplotype h are being removed 

(by individuals with haplotype h being re-drawn) at a rate 

^(t>h = —iph- 

A steady state occurs when the two processes balance, i.e. 

g,h:h-\-h—g 

which is a sufficient condition for (f)H to be a local maximum of the likelihood 
function of equation 1. Thus, the haplotypes sampled in this random application 
of Clark’s rules are distributed according to some stationary distribution of L. 

2.3 Clark’s Rule and Population Models 

The observation that the maximum likelihood method could be modeled by a cer- 
tain probabilistic application of Clark’s rule was known to researchers, Stephens, 
Smith, and Donnelly [50], who proposed a modification of the ML sampling pro- 
cedure of the previous section. Their modification introduces an approximate 
population genetics model [48] as a prior for observing the set of haplotypes. 

Instead of phasing randomly selected individuals with probabilities weighted 
by 4>h4>Tn they proposed a more complicated probability rule, where the weight 
of phasing h, h for g is given by 

^l+hT^h{h\h) ■ TTj^{h\h\h) (3) 

where h\h is the sequence of haplotypes h with one occurrence of h removed. 
The function TTh{h) approximates the probability that haplotype h might be 
generated either by direct descent or mutation from a population with haplotype 
counts of h. 

It should be noted that equation 2 applies only when N is much larger than 
the number of haplotype variants in the population. Thus it is not strictly ap- 
plicable for small populations where a substantial portion of variants occur only 
once. It is not an issue for this Markov Chain Monte Carlo (MCMC) approach. 

The algorithm they propose is to iteratively modify the explanation of the 
given genotypes, selecting the explaining haplotypes h, h for a random individual 
with genotype g, and replacing that pair with a pair generated randomly with 
weights from equation 3, updating the current frequencies, 4>h, of the variants 
in the sample. Statistics of the sampled set of phasings are then used to select 
the phasings of the individuals. 
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It remains to define the approximation of TTh{h), for which they propose 

7T,(/t) = ■ h){I - (4) 

h' 

where h ■ h counts the number of occurrences of /i in /i, 0 is an estimate for the 
per site mutation rate, I is the identity matrix, and M is the single site mutation 
matrix, M^h' = 1 iff and h' have exactly one mismatch and 0 otherwise. This 
is an approximation to the probability that h can come from some h' after a 
geometrically distributed number of single site mutations. This approximation 
arose from considering a random population model in [28]. It should be noted 
that while the matrix M appears to be of exponential size, an arbitrary element 
~ 2 N+e ^)~^ computed in 0{m) time. 

An implementation of this algorithm by Stephens, Smith, and Donnelly is 
PHASE [50,49]. An alternative implementation, which more closely follows the 
maximum likelihood method was produced by Niu et al [42]. PHASE works well 
on medium problems with a small population. 



2.4 Parsimony Formulations 

Extending Clark’s basic intuition that unresolved haplotypes are to look like 
known ones, a variety of parsimony objectives can be considered. 

In the context of haplotype phasing, the most parsimonious phasing refers 
to the solution that uses the fewest haplotypes. Hubbell [27] showed that this 
version of the problem is NP-hard, in general, by a reduction from minimum 
clique cover. Gusfield [22] solved the problem via an (exponentially large) integer 
programming formulation that is solvable in many cases, even for medium-sized 
problems. 



Phasing via Integer Programming Integer programming, as employed by 
Gusfield [22] to find the most parsimonious phasing, is a very effective solution 
technique for optimization problems with combinatorial structure, even when 
they are NP-hard. As Gusfield demonstrated, with a suitable integer program- 
ming formulation and a powerful solver package, many realistic instances can be 
solved. The tractability of such integer programs for a given instance depends 
critically on the nature of the integer programming formulation for the problem. 

We present two integer programming formulations for the haplotype phasing 
problem. 

Formulation 1 (Gusfield [22]) Given genotypes G, we define the set of pos- 
sible haplotypes H = {h\3g, h s.t. g = h h}. Further, for each genotype g € G 
we define the set of valid phasings, Pg = {{h, h}\g = h-\-h;h,h G H}. 

In this formulation, we use an indicator variable Xh for each haplotype h G H 
and an indicator variable ygp for each genotype g G G and p G Pg. 
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min ^ Xh 
h 




S.t. V9P = 1. 


for all g € G, 


pePg 




Ugp Si 


and 


Ugp S Xh, 


for all g € G and p = {h, h} G Pg 


Xh G {0, 1} 


for all h G H 


Vgp € 


for all g G G and p G Pg. 


The first constraint of formulation 1 ensures that all genotypes are phased by 
some phasing, while the second and third constraints ensure that if a genotype 
is phased by some haplotype pair, then those haplotypes are paid for in the 



objective function. 

In practice, this formulation can be prohibitively large. To solve this inte- 
ger program Gusfield shows that for particular problem instances many of the 
variables and constraints can be eliminated without changing the nature of the 
problem. This brings the size of the integer program down to manageable size 
for many instances. 

A more explicit formulation of the phasing problem models the resolution of 
the ambiguous genotype sites explicitly. 

Formulation 2 Given genotypes G = {gi, . . . ,(/„}, we denote the value at site 
i of genotype g G G by g{i). Let Ag he the ambiguous sites of g € G, so that for 
all i € Ag, g[i) = 2 . Further, let 'H{g) and 'H{g) represent the first and second 
phasing haplotype ofg, respectively, so thatg = H{g)+'H{g). Similarly, we denote 
the genotype g that haplotype h phases by Q{h), so that Qili-ig)) = Q{'H{g)). The 
phasing haplotypes of G are indexed so that = Tl{gi) and /i 2 i = 'H{gi). 

In this formulation, we use an indicator variable Xga for each ambiguous site 
a & Ag and genotype g € G. Xga = 0 indicates that the haplotypes TL{g) and 
'H{g) have a 0 and 1 at position a of g, respectively; while Xga = 1 indicates that 
haplotypes TL{g) and 'H{g) have a 1 and 0 at position a of g, respectively. The 
indicator variable cjj> is 1 if haplotypes hj and hf are different, and 0 if they 
are the same. Finally, we use the indicator variable Ij to indicate whether or not 
haplotype hj is the last use of the haplotype represented by hj. 





2n 

min 1, 

j—l 












L(hj , i) 


— L{hji , i) < Cjji 


for all 


j,f = 


l,...,2n,j 


<j',i= 1 ,.. 
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L{hji , i) 


VI 
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for all 


j,f = 


l,...,2n,j 


<j',i= 1 ,.. 
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L{h, z) = 0 




= o,g 


= Q{h) 








L{h, i) = 1 


if g{i) 


= 1,5 


= G{h) 








L{h,i) = Xgi 
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= 2,ff 


= G{h),h = 
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L{h, i) 


= (i 


Xgi) 


if g{i) = 2,g = 


g{h),h = H{g) 






■)<h 


for all j = 1,.. 


■ , 2n, 


j<j' 












X, 


ga=Q 


for all g G G, a 


= min Ag 


^ga ^ 


{0,1} 


for all g G G, a 


yf min Ag 






{0,1} 


for all j,f = 1, 


< j', 




Ij e 


{0,1} 


for all j = 1, . . 


• , 2n, 



The first and second constraints of formulation 2 enforce the definition of the 
Cjji in terms of the unambiguous sites of each genotype and the settings of the 
Xga variables, while the last constraint enforces the definition of the Ij ensuring 
that Ij is 1 if all later haplotypes are different. 

As with formulation 1, formulation 2 contains many unnecessary variables 
and constraints, depending on the particular problem instance that is being 
solved. This formulation isn’t particularly suitable for solving with traditional 
integer programming solvers, since even when particular c variables are known 
to be 1, the relevant x variables can take many possible values. 

Problem 8 (Implicit Phasing Integer Program). Reformulate formulation 2 to 
eliminate the x variables by instead deriving a sufficient set of constraints on the 
c variables from the set of genotypes G. 

An intriguing open problem is to determine whether there are practical in- 
stances when this problem can be solved efficiently (for example if the perfect 
phylogeny condition holds, see section 2.5). 

Problem 9 (Restricted Parsimony). Find a restriction on the input to the haplo- 
type phasing problem that most real world instances satisfy, for which the most 
parsimonious haplotype assignment can be found in polynomial time. 

Diversity is another commonly used parsimony objective in population genet- 
ics. Haplotype diversity is defined as the probability that two haplotypes drawn 
uniformly at random from the population are not the same. 

Problem 10 (Haplotype Diversity Minimization). Devise an algorithm for the 
haplotype phasing under the objective of minimizing haplotype diversity. 

We note that the integer programming formulation 2 above naturally solves 
the diversity objective under a suitable cost function. Graph theoretically, this 
problem can be posed as constructing a graph with a node for every haplotype 
in the observed population (two nodes for each observed genotype), an edge 
between every pair of haplotypes that are not equal and then minimizing the 
number of edges in the graph. 

We observe that Clark’s rule is not effective for parsimony. 

Lemma 1. Clark’s rule does not yield an effective approximation algorithm for 
parsimony. 
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Let Ud be the number of distinct genotypes in the G. The trivial algorithm of 
arbitrarily phasing each distinct genotype will return a phasing with at most 
2nd haplotypes. f2{-^/nd) is a lower bound on the number of haplotypes as each 
genotype is made of at most two distinct haplotypes. A worst case approximation 
guarantee is thus 0{^Jrid), we will give such an example. 



' (1111) -t (1111) ' 




' (nil) ' 




' (nil) + (nil) ' 


(1111) -t (0011) 




(2211) 




(0111) + (1011) 


(1111) -t (1001) 




(1221) 




(1011) + (1101) 


(nil) -t ( 1100 ) 


(1122) 


(1101) + (1110) 


(nil) -t ( 1010 ) 




(1212) 




(1011) + (1110) 


(nil) -t ( 0101 ) 




(2121) 




(0111) + (1101) 


. (nil) -t ( 0110 ) . 




. (2112) . 




. (0111) + (1110) . 



Fig. 3. Set of 7 genotypes with 7 haplotype Clark’s rule resolution Vc, and 4 
haplotype parsimony resolution Vp. 



Let m be the number of SNPs and let G be comprised of genotype that has 
all ones and all (™) possible genotypes that have exactly two 2s and all other 
SNPs as Is. Clark’s inference rule will initially infer the haplotype of all ones 
and then infer the (™) haplotypes that have all but 2 SNPs as Is. The resolution 
with the minimum number of haplotypes however has the m haplotypes with all 
but 1 SNP as 1. An example when m = 4 is given in Figure 3. □ 

The Hamming distance between a pair of haplotypes is, under the infinite 
sites model, the number of mutations that occurred in the evolutionary history 
between the pair of haplotypes. If we consider an evolutionary history to be a 
tree whose nodes are the unknown haplotype sequences of the observed genotype 
sequences, then a likelihood function which approximates it [43] in terms of 
Hamming distance is given by: 

L{h)^Y. n (5) 

T eeEdges(r) 

where T ranges over all trees on the 2n nodes with unknown haplotypes hi G H , 
I < i < 2n, e ranges over all 2n — 1 edges in T, D{e) is the Hamming distance 
between the hi and hj which are joined by e, and / is a monotonic function. One 
reasonable choice might by f{x) = where [3 plays the role of the mutation 
rate, or one might take / from equation 4. 

This sum over all trees of products of edge weights can be evaluated in poly- 
nomial time (using Kirchoff’s matrix-tree theorem [5,32]). Methods for sampling 
from this and related distributions can be found in [4,8]. 
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If we take f{x) = e then we can interpret equation 5 as a partition 
function from statistical mechanics, 

T 

where E{T^ h) is the sum of the Hamming distances on all the edges in T. 

Problem 11 (Partition Function Maximization) . Devise an algorithm which max- 
imizes 

/ 3 ) = ^ ( 6 ) 

T 

over all h explaining g. 

This problem has two asymptotic regimes. 

The first is the low temperature regime [3 oo, where, one can approximate 
the summation with maximization, 

Z{h] (3 ~ oo) ~ maxe”^^*-^’^^ 

= exp{— /3nnn E{T, h)} 

and approximate the partition function with the minimum weight tree. 

Problem 12 (Tree Minimization) . Devise an algorithm which finds 

min E(T,h) (7) 

T.h 

over all h explaining g and all trees T. 

The second is the high temperature regime /3 ~ 0 

Z(/^;/3)~^(l-/3^;(T,/^)) = (2n)2"-2(l-D ^ D{hi,h2)) 

T h\ C h 

where D(hi,h 2 ) is the Hamming distance between h\ and / 12 . In this extreme, 
the approximate problem is the minimization of the sum of all pairwise Hamming 
distances. 

Problem 13 (Sum of Pairs Hamming Distance Minimization). Devise an algo- 
rithm which finds 

min ^ D{hi,h 2 ) (8) 

hi,h2&h 

over all h explaining g and all trees T. 

Figure 4 gives an example where the sum of pairs Hamming distance mini- 
mization does not yield the same phasing as parsimony. 

At the time of this writing, we are not familiar with any progress on these 
problems. 
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' ( 11111111 ) + ( 11111100 ) ' 
( 11111111 ) + ( 11111001 ) 
( 11111111 ) + ( 11110011 ) 
( 11111111 ) + ( 11001111 ) 
( 11111111 ) + ( 10011111 ) 

^ ( 11111111 ) + ( 00111111 ) ^ 





' ( 11111122 ) ' 
( 11111221 ) 






( 11112211 ) 




. < 

> ^ 


( 11221111 ) 

( 12211111 ) 

^ ( 22111111 ) ^ 





' ( 11111101 ) + ( 11111110 ) ' 
( 11111011 ) + ( 11111101 ) 
( 11110111 ) + ( 11111011 ) 
( 11011111 ) + ( 11101111 ) ^ 
( 10111111 ) + ( 11011111 ) 

^ ( 01111111 ) + ( 10111111 ) ^ 



Fig. 4. Set of 6 genotypes with 7 haplotype parsimony phasing Vp, and 8 hap- 
lotype minimum sum of paired Hamming distances phasing Vh- 



2.5 Perfect Phylogeny 

The concept of a perfect phylogeny [11,46,3] has also been used to formulate 
constraints on haplotype phasings. A (binary) perfect phylogeny is defined as 
follows: Let S' be a set of n sequences (haplotypes) each drawn from A™, where 
the alphabet S = {0, 1}. We say that S admits a perfect phylogeny if there 
exists a tree T with n leaves that has the following properties: (1) Each leaf of 
T is uniquely labeled with a sequence from S, (2) Every internal node u in T is 
labeled with a sequence from A™, and (3) For each sequence position i (where 
\ < i < m) and for each a G S, the set of nodes whose sequence labels each 
have the symbol a at position i, forms a subtree of T. The tree T is said to be 
a perfect phylogeny for S. 

Gusfield [21] introduced a haplotype phasing problem that was motivated by 
studies on the haplotype structure of the human genome that reveal the genome 
to be blocky in nature ([9,25,47,17]), i.e., these studies show that human genomic 
DNA can be partitioned into long blocks where genetic recombination has been 
rare, leading to strikingly fewer distinct haplotypes in the population than previ- 
ously expected. This no-recomhination in long blocks observation together with 
the standard population genetic assumption of infinite sites, motivates a model 
of haplotype evolution where the haplotypes in a population are assumed to 
evolve along a coalescent, which as a rooted tree is a perfect phylogeny. Infor- 
mally, this means that each SNP changed from a 0 to a 1 at most once in this 
rooted tree (here we are assuming that 0 is the ancestral state for a SNP) . This 
motivates the following algorithmic problem called Perfect Phylogeny Haplotyp- 
ing problem (PPH) - given n genotypes of length m each, does there exist a set 
S of at most 2n haplotypes such that each genotype is explained by a pair of 
haplotypes from S, and such that S admits a perfect phylogeny? 

In [21], it was shown that the PPH problem can be solved in polynomial 
time by reducing it to a graph realization problem. The algorithm runs in 
0{nma{nm)), where a is the inverse Ackerman function, and hence this time 
bound is almost linear in the input size nm. The algorithm also builds a linear- 
space data structure that represents all the solutions, so that each solution can be 
generated in linear time. Although the reduction described in [21] is simple and 
the total running time is nearly optimal, the algorithm taken as a whole is very 
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difficult to implement, primarily due to the complexity of the graph realization 
component. 

Following the work in [21], additional algorithms [2,14] have been proposed 
to solve the PPH problem that are simpler, easy to program and yet still effi- 
cient. These algorithms also produce linear-space data structures to represent all 
solutions for the given instance. Though they use quite different approaches, the 
algorithms in [2] and [14] take 0{nrn?) time. In [2], a non-trivial upper bound on 
the number of PPH solutions is also proved, showing that the number is vastly 
smaller than the number of haplotype solutions when the perfect phylogeny re- 
quirement is not imposed; furthermore, a biologically appealing representation 
is proposed that aids in visualizing the set of all solutions. In [14], an approach 
is also provided to deal with parent-child genotype data. 

There are several interesting questions posed as a result of the works of 
[21,2,14]. We list three of them here. 

Problem 14 (Optimal PPH). Can the PPH problem be solved in 0{nm)l If so, 
is a practical algorithm possible? 

Problem 15 (PPH with missing data). Devise solutions to deal with missing data 
and errors in the input. 

The above problem is important as real data, very often, contains both miss- 
ing data and errors. There are several directions that one could pursue here. For 
example, as in [14], one could study the complexity of the problem of removing 
a minimum number of genotypes so that the phasing of the remaining genotypes 
admits a perfect phylogeny. Alternatively, one could ask the question, can each 
missing value be set to one of 0, 1, or 2 so that the resulting instance has a 
perfect phylogeny? Halperin and Karp [12] study this problem in a framework 
where the data are assumed to be generated by probabilistic models. Their re- 
sults include a quadratic-time algorithm for inferring a perfect phylogeny from 
genotype data (with missing values) with high probability, under certain distri- 
butional assumptions. 

The above problem is important as real data, very often, contains both miss- 
ing data and errors. There are several directions that one could pursue here. For 
example, one could ask the question, can each missing value be set to one of 0, 1, 
or 2 so that the resulting instance has a perfect phylogeny? Alternatively, as 
in [14], one could study the complexity of the problem of removing a minimum 
number of genotypes so that the phasing of the remaining genotypes admits a 
perfect phylogeny. 

Problem 16 (PPH with recombination). What is the complexity of the problem 
when small deviations from the no-recombination model are allowed? 

For instance, allowing for a small number of recombination events, can we 
still phase the genotypes efficiently in this framework? Allowing recombination 
events means that the solution is no longer a tree but a network (i.e. a graph 
with cycles) [51]. 
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Recently, several approaches have been presented that explicitly model re- 
combination [18,13,31]. 

Further work on haplotype phasing under the perfect phylogeny assumption 
has been Damaschke [10], by making the assumption that each haplotype has 
frequency at least a probabilistic algorithm is given that determines a phasing 
in time 0(mnpolylog(m)). 



3 Haplotype Assembly 

The need to infer haplotypes directly from genotypes is based on the assumption 
that biotechnologies for haplotype determination are unlikely to be available in 
the short term. This may not be the case. Various approaches to single molecule 
sequencing have been described recently [40,41,6] and some of these may mature 
to the point that phasing based solely on genotype analysis becomes unnecessary. 

An increase in the current read length (~ 500) in a sequencing reaction to a 
few thousand basepairs, make it possible to phase large regions of a chromosome. 
Assuming that a SNP occurs every 1000 basepairs, many fragments will contain 
multiple SNPs. Consider a sequence assembly containing fragments /i,/ 2,/3 
from a single individual. If fi and /2 differ in a SNP, they must come from 
different chromosomes. Likewise if / 2 , and /s also differ in (some other) SNP, they 
come from different chromosomes. However, for a diploid organism, this must 
imply that fi and /s come from the same chromosome, and we have therefore 
phased the individual in this region (see Figure 5). 

If reads have a length L and the distance between SNPs is exponentially 
distributed with a density p, then, with high coverage, the probability of being 
able to resolve the phase of two adjacent SNPs is (1 — e“'’^), which is close 
to 0.99 even for pL = 5. Thus, an order of magnitude increase in the current 
average read length could result in a haplotype resolution by haplotype assembly 
of groups on the order of 100 SNPs. Even current technology can produce reads 
of over 1000 basepairs, by creating a gapped read, where only the ends of the 
fragment are actually read, leaving a large gap in the middle. 

Formally, define a SNP matrix M with rows representing fragments, and 
columns representing SNPs. Thus M\f,s] G {0, 1,—} is the value of SNP s in 
fragments /. Gapped reads are modeled as single fragments with gaps (— ) in 
SNPs that in the gap. Two fragments / and g conflict if there exists SNP s 
such that M[f,s] = 0, and M[g,s] = 1 or vice-versa. Based on this, a SNP 
matrix M can be used to define a fragment conflict graph Gjf. Each fragment 
is a node in Gjf. Two nodes are connected by an edge if they have different 
values at a SNP. It is easy to see that Gjp is bipartite in the absence of errors. 
In the presence of errors, we can formulate combinatorial problems that involve 
deleting a minimum number of nodes (poor quality fragments), or edges (bad 
SNP calls), so that the resulting graph is bipartite (can be phased trivially). 
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Fig. 5. An illustration of the construction of long-range haplotypes from assem- 
bly data. A) the fragment assembly of a diploid organism. B) the identification 
of SNPs. C) the partitioning of the fragments into two consistent groups, intro- 
ducing a long-range phasing. 



In [45,34], the following is shown: 

1. The minimum fragment removal and minimum SNP removal problems are 
tractable if the underlying matrix M has the consecutive ones property, i.e., 
there is no gap within a fragment. 

2. They are NP-hard in the case of gaps, even when limited to at most one 
gap per fragment. The problems are shown to be tractable under a fixed 
parameter. That parameter being the total length of gaps in a fragment. 

The algorithms are thus not tractable for dealing with the case of fragments with 
gaps, and it is an interesting open problem to design heuristics/ approximations 
that give good results in practice. Some branch and bound heuristics were re- 
ported to work very well on real and simulated assembly data in [38] . Li et al. [30] 
give a statistical reformulation of this problem. 

A fairly immediate extension to this problem, is the problem of simultane- 
ous assembly multiple haplotypes. This will occur when studying multiploidal 
organisms or when simultaneously assembling related organisms, or a pooled set 
of individuals. For practical consideration it may be easier to sequence multiple 
related organisms simultaneously, for example to assemble different strains of a 
bacteria simultaneously. 

Problem 17 (Multiploidal Haplotype Assembly). Devise an algorithm for assem- 
bling multiple haplotypes simultaneously. 
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4 Discussion 

While the subject of haplotype phasing, or frequency inference may be of interest 
on purely statistical and mathematical grounds, the desired end result generally 
is an understanding of the implications of genetic variation in individual pathol- 
ogy, development, etc. As such, these variances are one part of a larger set of 
interactions which include individual environment, history, and chance. 

Although the media often carries stories of “genes” (actually alleles) being 
found for some popular diseases, biologists agree that such stories are rather the 
exception than the rule when it comes to disease causation. It is suspected to be 
more the case that an individual’s genetic variations interact with each other as 
well as other factors in excruciatingly complex and sensitive ways. 

The future open problems of SNP analysis are those regarding the interac- 
tions of genetic variation with external factors. Substantial progress in multifac- 
torial testing, or a tailoring of current multifactorial testing to this setting, is 
required if we are to see an impact of haplotype analysis on human health care. 
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Abstract. Haplotypes, defined as a set of DNA polymorphism markers physi- 
cally located on a single chromosome, have gained exploding magnitude of in- 
terest owing to its potential value in disease gene identification and in pharma- 
cogenomics. Because molecular haplotyping methods remain too costly to be 
used on a regular basis, statistical techniques for haplotype inference have 
emerged as the most time- and cost-efficient approach. This chapter explains 
the statistical theory and algorithms behind several in silica haplotype phasing 
strategies; reviews the partition-ligation idea for dealing with a large number of 
linked SNP marker loci; and proposes new methods for handling genotype un- 
certainty in the genotyping machine output as well as the pooled marker data. 
We also discuss the application of haplotype information in disease mutation 
detection in case-control designs and the impact of haplotype information on 
locus estimation accuracy. As an illustration, we applied the haplotyping tool 
PL-EM jointly with the LD mapping algorithm BLADE to a case-control study 
of the SNP markers surrounding the Alzheimer disease susceptible gene APOE. 



1 Introduction 

Single base changes (i.e. substitutions, insertions or deletions), or single nucleotide 
polymorphisms (SNPs), are the most abundant genetic markers distributed across the 
entire human genome, and they are the “nuts and bolts” in deciphering the genetic 
bases of common human diseases (Shastry 2002). Since particular alleles at neighbor- 
ing loci tend to be associated with one another due to “co-inheritance”, known as 
linkage disequilibrium (LD) (Ardlie et al. 2002), this information can be used to pin- 
point genes associated with certain diseases. LD typically decays exponentially with 
increasing physical distance, but the decay rate is extremely unevenly distributed 
across the genome landscape. Recombination hot spots, genetic drift and natural se- 
lection influence the extent of LD in the human genome, although their relative 
contributions remain unclear. The most pressing challenge now is to develop tools for 
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most efficiently using SNP data for disease gene mapping, evolutionary history re- 
construction, and pharmacogenomics. 

The term haplotype was coined to describe the “joint configurations” of alleles for 
a set of linked markers (e.g. SNPs) linearly ordered on a single chromosome. Haplo- 
types are arguably more biologically meaningful units than single markers, since 
effects of alleles at single, isolated markers are often difficult to assess due to their 
LD with certain allele combinations at neighboring loci. Resolving haplotype phase is 
important for two reasons. First, for a set ofN tightly linked SNPs, only a small num- 
ber (of order 0(N)) of common haplotypes are generally found in the population as 
opposed to all possible allele combinations (2^); and this can potentially increase the 
sensitivity of gene-disease association studies. Second, haplotype configuration cap- 
tures information of population history and recombination frequency in a particular 
genomic region. Such information is of great value for association studies. 

Both the human genome (Patil et al. 2001; Gabriel et al. 2002) and the mouse ge- 
nome (Wiltshire et al. 2003) have been found to consist of discrete haplotype blocks: 
sizeable regions over which there is little evidence for historical recombination. In 
humans, the boundaries of such blocks and the haplotypes within each block ob- 
served were found to be remarkably similar across populations from different conti- 
nents (Gabriel et al. 2002), supporting the idea that a single ancestral population of 
small size gave rise to the current human population. Moreover, the similarity of 
haplotype blocks across various ethnicities indicates that a haplotype map would be 
widely useful across the human population. Flaplotype blocks are important because 
the within-block haplotype diversity is often very limited due to high LD. Therefore, 
the common haplotypes within each block can be represented by a few tagging SNPs 
that are extremely cost-efficient in conducting association studies of common human 
diseases (Johnson et al. 2001). 

In this chapter, we review the statistical basis of several recent haplotype inference 
algorithms and discuss some complications and applications of these algorithms. We 
observe that (a) by incorporating genotype uncertainties, we can improve the accu- 
racy of haplotype phasing significantly, and (b) the population-based in silica haplo- 
type construction is helpful for fine-mapping of genes related to complex diseases, 
but not so much for simple Mendelian diseases. 



2 Basic Statistical Models for Population-Based Haplotype 
Inference 

The most straightforward approach to resolving ambiguities in haplotype determina- 
tion is to perfomi costly and labor-intensive “wet-lab” experiments, such as allele- 
specific long-range PCR (Michalatos-Beloin et al. 1996), or the diploid-to-haploid 
conversions (Douglas et al. 2001). Another standard approach is to resort to pedigree 
information. However, in most case-control genetic epidemiology studies, both ap- 
proaches are too costly to be practical with the latter one often being infeasible. In the 
past decade, several in silica haplotype-phasing approaches have been developed, 
including the parsimony algorithm (Clark 1990), the Expectation-Maximization (EM) 
algorithms (Excoffier and Slatkin 1995; Hawley et al. 1995; Long et al. 1995; Chiano 
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and Clayton 1998, Qin et al. 2002), and the Gibbs Sampler based algorithms 
(Stephens, et al. 2001, Niu et al. 2002). 

Clark’s method assumes no explicit statistical model and is generally inferior to the 
likelihood-based methods (Niu et al. 2002). In all population-based statistical haplo- 
type inference methods, one assumes that an individual’s two haplotypes are two 
random draws from a “pool” of haplotypes with potentially differing haplotype 
frequencies. Since we observe only the genotype data, the phase infomiation is lost 
and treated as missing data. The likelihood of this model can thus be written as 

/>(r|e)=nF(v,|e)=li Z 

/=0 i=0 {g,h)\g@h=y^ 

where Y and Z denote the observed genotype data and unobserved phase infomia- 
tion, respectively; g and h denote two particular haplotypes and g®h = y. indi- 
cates that they are compatible with the observed genotype data and n is the 
number of individuals in the sample. The EM algorithm boils down to the iteration: 

^d+i) _ {n^ \ Y')1 2n, where is the current estimate of haplotype frequen- 

cies, rig is the number of haplotype g that may exist in the sample. Besides its likeli- 
hood of being trapped in a local optimum of the likelihood surface, the canonical EM 
algorithm is also limited in the number of linked SNP markers it can handle due to 
computational constraints (usually ~20). The PL-EM method of Qin et al. (2002) 
overcomes this shortcoming by combining EM algorithm with the partition-ligation 
(PL) idea (details later). 

The Bayes method (Stephens et al. 2001, Niu et al. 2002) assumes a prior distribu- 
tion on © and obtains its inferential results by Markov chain Monte Carlo (Liu 2001). 
The most straightforward prior for © is the Dirichlet distribution Dir (/?), with which 
one obtains the joint posterior distribution for (Z, ©): 

P(Z,0|7)oc]^^”'^^, 

./=i 

where P = is the pseudo-count vector, is the number of haplotype 

j in the phased sample, and K is the total number of distinct haplotypes. From this 
formula, one can further integrate out © to obtain that 

P{Y,Z)a:T[fi + N{Z)], 

where N{Z) = (n, , ? • • •) T (v) = T(v^ ). A predictive updating Gibbs 

sampler (Liu 1994; Chen and Liu 1996) can be implemented, which iteratively up- 
dates each individual’s phase by a sample drawn from the distribution 

P[z. =(g,/t)|Z[^,.],7]oc(ng+^^)(n, (*) 

where g® h = . An unsatisfactory aspect of this approach is that the simple 

Dirichlet prior cannot capture the biological reality and often introduces artifacts 
because of the excess number of all possible haplotypes. To minimize the effect of 
the prior, Niu et al. (2002) employs a prior annealing approach, which lets each Pj 
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converge to zero along with the iteration. Thus, their algorithm in fact samples from 
the distribution 

P(7,Z)^r[iV(Z)] = n„^^/n^.-l)! 

The optimal Z maximizing the above function is often used as the phase prediction 
of each individual. In contrast, Clark’s algorithm phases the individuals so as to 
minimize #{/: «y>0}, which can be seen as a crude approximation to the above poste- 
rior mode. 

To reflect the relationship among real haplotypes, the PHASE algorithm (Stephens 
et al. 2001) attempts to incorporate the coalescence theory into the prior. Since it is 
technically difficult to work with such a prior, PHASE modifies the predictive updat- 
ing fomiula (*) to build in the prior knowledge. However, the result from the algo- 
rithm is no longer a valid posterior distribution. Nevertheless, PHASE appears to 
perform well under the coalescent model-based simulations. The algorithm can also 
handle a large number of SNP markers, although with a much prolonged running time. 

To overcome the size limitation of the canonical EM algorithm and the slow con- 
vergence of PHASE, Niu et al. (2002) introduced the PL idea, which allows one to 
phase individuals with a large number of SNPs rapidly, and implemented it with the 
Gibbs sampler in a program called HAPLOTYPER. More specifically, this algorithm 
divides the entire haplotype into an ordered set of smaller blocks. After inferring the 
phases for each block, the final full haplotype phase is derived by piecing together the 
partial solutions recursively. Its performance is further enhanced by the prior anneal- 
ing strategy. Qin et al. (2002) implemented an algorithm called PL-EM, which ap- 
plied the PL strategy to the EM algorithm. To avoid throwing away partial haplotypes 
prematurely, PL-EM adopted a backup-buffering strategy to keep a selected set of 
less-than-optimal solutions of partial haplotypes. Furthermore, it is capable of vari- 
ance calculation for estimated haplotype frequencies and can cope with missing geno- 
type data. With the use of parallel computers, phasing at the leaf-node level of the 
hierarchical ligation can be done in a concurrent and distributive fashion, which can 
further improve the computational efficiency of PL-based haplotype phasing algo- 
rithms. 

In Table 1, we applied both HAPLOTYPER and PL-EM to phase the 14 SNPs 
(spanning from SNP479 to SNP988) on the apolipoprotein E (APOE) gene for 93 
cases with Alzheimer disease and 105 disease-free controls, respectively (Martin et al. 
2000 ). 

Using both HAPLOTYPER and PL-EM phasing results, we applied a block- 
partitioning method based on the dynamic programming algorithm (Zhang et al. 2002) 
among the cases and controls respectively. The program found that the haplotypes 
could be partitioned into 3 blocks among the cases (the first block contains only the 
first SNP, SNP982; the second block spans 5 SNPs from SNP987 to SNP874; the 
third block consists of 8 SNPs from SNP992 to SNP988), and 4 blocks among the 
controls (the first block consists of the first 5 SNPs spanning from SNP982 to 
SNP465, the second block contains only SNP874 itself; the third block contains 6 
SNPs from SNP992 to SNP888; and the rest 2 SNPs, SNP877 and SNP988 are lo- 
cated in the fourth block). Comparing the block stmctures of the cases and controls, 
we can see that the data suggest a recombination hotspot between SNP874 and 
SNP992 (the same breakpoint that separates blocks #2 and #3 in both cases and con- 
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trols), which happens to coincide with the boundary between the TOM40 gene (lo- 
cated in an intron of APOE) and its neighboring APOE exon. These results are in 
agreement with pairwise ED findings (data not shown). Thus, a haplotype block- 
based association test can be performed using tagging SNPs. 



Table 1. Haplotype analyses of the data on 14 SNPs (spanning SNP982-SNP988; Martin et al. 
2000) using HAPLOTYPER and PL-EM in Alzheimer disease cases and controls, respectively. 
Subjects with missing data > 30% were removed. HAPLOTYPER and PL-EM found 62 and 60 
haplotypes in controls, and 47 and 49 haplotypes in cases, respectively. Only major haplotypes 
with frequencies >5% in either cases or controls were shown 



(a) HAPLOTYPER Results* 



Haplotype 


Haplotype 

Configuration 


Controls (A=105) 
Count % 


Cases (A=93) 
Count % 


HI 


22222221122122 


18 


8.57143 


16.0 


8.60215 


H2 


12112211212222 


16 


7.61905 


5.0 


2.68817 


H3 


22122221122122 


14 


6.66667 


5.0 


2.68817 


H4 


22212211212222 


12 


5.71429 


13.0 


6.98925 


H5 


12121211212222 


11 


5.23810 


0 


0 


H6 


22121212222121 


9 


4.28571 


25.0 


13.44086 


H7 


22121211212222 


4 


1.90476 


10.0 


5.37634 


H8 


12121212222121 


0 


0 


16.0 


8.60215 


(b) PL-EM Results 


Haplotype Haplotype 


Controls (A=105) 


Cases (A=93) 




(14 loci) 


Freq. 


Std. Err. 


Freq. 


Std. Err. 


HI 


22222221122122 


0.0992 


0.0213 


0.0713 


0.0196 


H2 


12112211212222 


0.0897 


0.0200 


0.0520 


0.0170 


H3 


22122221122122 


0.0574 


0.0167 


0.0409 


0.0157 


H4 


12121211212222 


0.0563 


0.0163 


0.0270 


0.0127 


H5 


22212211212222 


0.0536 


0.0159 


0.0376 


0.0140 


H6 


22121212222121 


0.0535 


0.0160 


0.1674 


0.0280 


H7 


22221212222121 


0.0099 


0.0078 


0.0651 


0.0191 



3 Practical Issues with Haplotype Inference 

Concurrent with the Human Genome Project, a variety of high-throughput genotyp- 
ing technologies have been developed, such as the 5’ nuclease assay (TaqMan®), 
oligonucleotide ligation assay. Invader assay, and Sequenom’s matrix assisted laser 
desorption/ionization time-of-flight mass spectrometry assay (reviewed in Tsuchi- 
hashi and Dracopoli 2002). The majority of genotyping calls were automatically 
generated from the machine readouts, or by manual checking in case of uncertainties. 
Recent studies suggest that even slight amounts of genotyping error can significantly 
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impact on LD estimation (Akey et al. 2001) and haplotype phasing accuracy (Kirk 
and Cardon 2002). We addressed the issue of genotyping uncertainty that gives rise to 
errors using a novel algorithm (see Section 3.1). Furthermore, since haplotype infer- 
ence by genotyping individuals separately is shown to be both low-throughput and 
resource-intensive, DNA pooling strategy has emerged as a new paradigm in gene- 
mapping methodology (Sham 2001) that can dramatically improve efficiency and 
lower the costs, especially at the SNP screening phase (Bansal et al. 2002). We will 
discuss a new statistical model and its implied algorithm in this case (Section 3.2). 



3.1 Haplotype Inference with Genotype Uncertainty 



Conventionally, in cases when the raw genotyping machine output of a genotype is 
ambiguous (i.e. unclear to be scored as a heterozygote or a homozygote), either a 
genotype call is made based on human judgments or the genotype is treated as com- 
pletely missing. Kang et al. (2003) introduced a Bayesian clustering algorithm based 
on the mixture bivariate t-distribution. The algorithm finds the center and spread of 
each cluster using the parameter-expanded data augmentation scheme (Liu and Wu 
1999; van Dyk and Meng 2001) and assigns each point probabilities that it belongs to 
all possible clusters. An EM algorithm is then used to incorporate these genotype 
probabilities for haplotype inference. This algorithm generalizes the previous EM- 
based algorithm (Qin et al. 2002) by treating each person’s genotype as a mixture of 
multiple possible genotype, referred to as genotype spectrum. Consider the collection 

of genotype spectrum {Y'\ IT*), for / = , with Y' = (f,' , ) be- 
ing the set of possible genotypes with respective probabilities 11“ = {7t[ , ) , 

where vn^ is the number of possible distinct genotypes and theTT* ’s sum to one. The 



likelihood function of the model is 



/>(r|0,n)=n/>(r|0,n) = n If Z 



/=1 



1=1 yj=i {g,h):g@h=Y‘- J 



The MLE for 0 = {6^, is computed by the following EM iterations: 

• Updating the haplotype frequency 0 = {6^,..., 6^): 






gd+1) = ^ ^ ^ 



Zf Z 

V=1 (g' ,h')-.g'®h'=Y] 



where \ g denotes the complement haplotype that paired together with g 



to make up the genotype 
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(Optional) Updating the probability spectrum II = (11^, . . . , II" ) : 



^.,(0 ^ 



«+i) 

h 



7t 






h=Y] 



j=\ gmh=Y] 



where denotes the probability for the /*' possible genotype in its spectrum for 
the f' individual at iteration t . 

After convergence, an individual’s haplotype phases is determined by the compati- 
ble haplotype pair ig,h) that maximize 7t‘j0^0i^ . Our preliminary analyses showed 



that by incorporating genotyping uncertainties, we can improve both haplotype infer- 
ence accuracy and the power in case-control studies significantly (Kang et al. 2003). 



3.2 Haplotype Inference Using Pooling Data 

Several groups have developed haplotype phasing schemes that deal with genotype 
data generated from DNA pools (Ito et al. 2003; Pe’er and Beckmann 2003; Wang et 
al. 2003). Following the formulations and notations in Pe’er and Beckmann (2003), 
we observe the counts, or equivalently, the frequencies of the “1 ’’-allele (“0”- or “1”- 
alleles are arbitrarily defined) for each SNP marker in a given haplotype block across 
several populations. The goal is to reconstruct optimal haplotype block and estimate 
their frequencies within each sample. In summary, denoting binary digits (0 or 1) by 
B , and probabilities ([0,1] interval) by P , we have the following problem: 

Input: A matrix of allele frequencies in S SNPs over P populations, 

M e 

Goal: Find the matrix of H common haplotypes B G coded by 0 or 1; 

Find the matrix of the frequencies oiH common haplotypes within each 
population, F E such that E = FB is “closest to M . 

Pe’er and Beckmann (2003) defined a measure of the noisiness of M with respect 
to E and developed an algorithm that minimizes: 

Noise(P,M) = maXp^ 

where e and m represent the elements of matrix E and M respectively. This defini- 
tion is based on the assumption that the sampling error is small enough to be ignor- 
able. 

To enable a likelihood approach, we assume that , the observed count of “1”- 
allele for SNP location S and population p , follows the Binomial distribution 
Bin(np,epj,) , where rip is the size of samples from population p . The likelihood 
function is given by: 



max 


^p.s 


I- 


^ps 


m 


I- 


m 
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S=1 

which allows us to use the maximum likelihood estimation method to estimate E. We 
have tested both the new method and the method by Pe’er and Beckmann, and the 
two methods produced similar results in most case. More comparisons of two algo- 
rithms are in progress. Finally, it is noted that the PL strategy presented in Section II 
can be integrated into the above algorithms. 




Fig. 1. The histograms of sampled positions and the single marker measurements. We used 30 
SNPs markers around APOE-4 (in Martin et al. 2000) to test the disease mapping ability of 
BLADE algorithm. The SNPs under consideration span a region of 615kb. The APOE-4 (i.e. 
SNP528) is located 425kb from the left-most marker, SNP479. The physical distances were 
translated into genetic distances by assuming 1Mb = IcM. The blue bars denote the histogram 
of positions sampled by BLADE. The circled line indicates the single marker measurements. 
The three vertical lines denote the mean position and the 95%PI bounds, respectively. The zero 
point of x-axis is set to be the position of APOE-4, and the distal direction to be positive, (a) 
haplotype uncertainty was modeled jointly with LD tests in BLADE; (b) haplotype phasing 
was performed first using PL-EM, and LD testing was done subsequently. 



4 LD Mapping with Genotype and Haplotype Information 

An important application of haplotype information is the fine-mapping of disease- 
related mutations. It has been observed in many cases that a large portion of the carri- 
ers of the disease gene in the current population are descendants from a small number 
of “founders” . Since LD exploits allelic relationships of neighboring markers, this 
simple approach has been shown to be more effective than the linkage approach in 
positional cloning of complex, quantitative traits such as taste sensitivity to phenyl- 
thiocarbamide (Kim et al. 2003), asthma (Van Eerdewegh et al. 2002), and type-2 
diabetes (Horikawa et al. 2000). 

Traditional pairwise LD-based approaches (Kerem et al. 1989; Hastbacka et 
al. 1992; Ozelius et al. 1992) suffer from limitations that become evident when the 
information content of the data is low, for example, when missing data, multiple 
founders, and unphased chromosomes, are present. Additionally, pairwise disequilib- 
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rium measures are not robust to variations in marker allele frequeneies, multiple 
founder mutations, or varying mutation rates at markers. Likelihood-based methods 
address some of these problems by making speeifie assumptions on the evolutionary 
history of the disease haplotypes (Kaplan et a/. 1995; Xiong and Guo 1997; Graham 
and Thompson 1998; Rannala and Slatkin 1998). But these methods require assump- 
tions on many exogenous parameters whose values are unknown, and the likelihood 
funetions on whieh they are based are often too eomplex. Several truly haplotype- 
based LD mapping methods are based on hidden Markov modeling (Lam et al. 2000; 
Liu et al. 2001; MePeek and Strahs 1999; Morris et al. 2000). Although they do not 
explieitly model the evolutionary relationship of the haplotypes, their results have 
shown some promises. Here we foeus on the use of the BLADE algorithm developed 
in Liu et al. (2001) in eombination with haplotype inferenee strategies PL-EM. 



4.1 Haplotype-Based Inference of Disease Locns Estimation 

BLADE is eurrently the only method that allows for multiple aneestral haplotypes, 
and is well suited for the fine mapping of a disease gene within a previously identi- 
fied linked region. BLADE is also flexible in treating various eomplieations sueh as 
missing marker data, multiple founders, and unphased ehromosomes. For example, 
the algorithm provides not only the estimation of the mutation loeation but also the 
haplotype eonstruetion in the ease when part or all of the disease ehromosomes are 
unphased. 

In this model, a haplotype in the ease group is either a “disease haplotype,” a de- 
seendent of one of the aneestral founders, or a phenoeopy. A disease haplotype dif- 
fers from its founder beeause of the reeombination events near the disease loeus and 
point mutations. The haplotypes in the eontrol group ean be treated either as in equi- 
librium or following a Markov model, with the latter model espeeially appropriate 
when densely spaeed SNPs are eonsidered (Liu et al. 2001). An EM algorithm ean be 
used to treat unphased eontrol ehromosomes under the Markov model (Lu et al. 
2003). 

Many previous authors have diseussed the issue of dependent disease haplotypes 
due to the underlying genealogy. As stated in Rannala and Slatkin (2000), the use of a 
star genealogy “undoubtedly involves a tradeoff of statistieal aeeuraey and effieieney 
for mathematieal simplieity and rapid eomputability.” Liu et al. (2001) treated the 
defieieney of the star genealogy by allowing for multiple aneestral elusters and by 
haplotype purging (as implemented in their eystie fibrosis example). Haplotype- 
speeifie weighting sehemes similar to the one used for protein sequenee analysis 
eould be another promising route. 



4.2 Integration of Haplotype Phasing into LD Testing 

To identify the eorreet loeation of the disease-related mutation among a set of tightly 
linked markers, it is of interest to assess how mueh the haplotype information adds to 
the improvement of the aeeuraey. Lu et al. (2003) showed that haplotype inferenee 
and the loeation estimation ean be aehieved at the same time via a joint statistieal 
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model if a simple Mendelian disease is under eonsideration. But in the eomplex dis- 
ease ease, it is often helpful to eonduet haplotype phasing before fine-mapping. To 
illustrate these points, two strategies for loeating the disease mutations based on un- 
phased disease genotypes were eonsidered: (a) integrating haplotype phasing with LD 
mapping; i.e., applying BLADE directly to the unphased individuals; and (b) per- 
forming haplotype phasing by PL-EM and then applying BLADE to the inferred 
haplotypes. We compared the perfomiances of (a) and (b) in two examples: (1) cystic 
fibrosis; (2) Alzheimer disease. 

(1) Cystic Fibrosis 

Cystic fibrosis is a Mendelian trait caused by mutations at the CFTR gene. By 
comparing strategies (a) and (b) on 100 permuted cystic fibrosis datasets and another 
100 simulated datasets, we observed that strategy (a) is superior to strategy (b) ac- 
cording to the root mean square error between the estimated location and the true 
location and the percentage of times (in the 100 simulations) that the 95% probability 
interval (PI) overlapped with the target region, no matter which phasing algorithm 
(HAPLOTYPER, Niu et al. 2002; PL-EM, Qin et al. 2002; PHASE, Stephens et al. 
2001; HAPINFERX, Clark 1990) or LD mapping algorithm (BLADE, Liu et al. 2001; 
DHSmap, McPeek and Strahs 1999) was used. This observation supports our hy- 
pothesis that the Mendelian trait itself is a useful piece of information in haplotype 
inference. Thus, a joint model for both disease-gene locus and disease haplotypes is 
more favored in this case. 

(2) Alzheimer Disease 

Alzheimer disease is a complex disorder, and APOE is a well-established suscepti- 
bility gene based on previous reports. Martin et al. (2000) genotyped 60 SNPs in a 
3.5-Mb region surrounding APOE in samples of unrelated cases and controls. Single- 
SNP analyses found evidence of association (P<.05) for 7 of 13 SNPs, including the 
APOE-4 polymorphism, spanning 40 kb on either side of APOE. As expected, the 
strongest evidence for association with AD was seen for the APOE-4 polymorphism, 
as well as for two other SNPs that he <16 kb from APOE. We re-analyzed the APOE 
SNP data using BLADE (Lu et al. 2003) directly. Our results (Figure 1) showed that 
BLADE can handle complex traits, but its performance decreases as the complexity 
of the disease phenotype increases. In contrast, strategy (b) gave a more accurate 
estimate and a much narrower 95% PI for the location of the disease mutation. Be- 
cause of the polygenic nature and the susceptibility of environmental factors of a 
complex disease, the number of disease haplotypes in the case group can be very 
small (~ 20% in the APOE case). As a result, no significant gain is expected in locus 
estimation by combining haplotype uncertainty with LD mapping, and a joint model 
often adds to the complexity of the problem and leads to an inferior result. Indeed, 
inferring haplotype phase first and then performing LD mapping [i.e. strategy (b)] 
may have both computational and inferential advantages. 



5 Discussion 

We have presented a review of some recent work in haplotype inference and their 
applications in disease gene mapping. All the evidence seems to suggest that using 
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explicit statistical models in various stages of genetic analyses is a fruitful strategy 
and can often lead to near-optimum utilization of available genetic information. An- 
other advantage of explicit statistical modeling is that one can criticize and improve 
upon the existing models so as to better analyze the available data. An important issue 
in such analyses is how much simplification we want to impose and how this 
simplification fits or differs from reality. For example, the EM algorithm or the 
Dirichlet-prior based Bayesian haplotype inference methods do not impose any rela- 
tionship among the human haplotypes, which is clearly unrealistic since the 
evolutionary theory dictates that we all descend from a common ancestor. Stephens et 
al. (2001) propose a good heuristic approach to incorporate some evolutionary 
information. It is also possible to prescribe a “metric” structure on the space of 
haplotypes more directly. 

The current BLADE algorithm uses either equilibrium or Markov model to de- 
scribe the control haplotypes and assumes an exponential decay of ED among the 
disease haplotypes. This may not be sufficient in light of recent discussions of haplo- 
type blocks. It is perhaps desirable to model the combination of ED decay with the 
block-type haplotype structure. A simple way to achieve this is to allow for inhomo- 
geneous conversion rates between physical and genetic distances. These important 
issues deserve more attention and efforts from statistical geneticists. 
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Abstract. There has been considerable recent interest in the use of 
haplotype structure to aid in the design and analysis of case-control as- 
sociation studies searching for genetic predictors of human disease. The 
use of haplotype structure is based on the premise that genetic variations 
that are physically close on the genome will often be predictive of one 
another due to their frequent descent intact through recent evolution. 
Understanding these correlations between sites should make it possible 
to minimize the amount of redundant information gathered through as- 
says or examined in association tests, improving the power and reducing 
the cost of the studies. In this work, we evaluate the potential value 
of haplotype structure in this context by applying it to two key sub- 
problems: inferring hidden polymorphic sites in partial haploid sequences 
and choosing subsets of variants that optimally capture the information 
content of the full set of sequences. We develop methods for these ap- 
proaches based on a prior method we developed for predicting piece- wise 
shared ancestry of haploid sequences. We apply these methods to a case 
study of two genetic regions with very different levels of sequence di- 
versity. We conclude that haplotype correlations do have considerable 
potential for these problems, but that the degree to which they are use- 
ful will be strongly dependent on the population sizes available and the 
specifics of the genetic regions examined. 



1 Introduction 

Since the release of draft consensus human genome sequences [8, 24], much at- 
tention has turned to studying the variations in the genome that distinguish one 
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person from another. These variation occur predominantly in the form of single 
nucleotide polymorphisms (SNPs) at which a single DNA base pair has two com- 
mon variants in the population. It is hoped that characterizing these variations 
and correlating them with phenotype through case-control association studies 
can assist in locating genes or specific genetic variants that influence human dis- 
ease risk. Association-based studies are likely to allow much finer-scale mapping 
than was possible with the traditional pedigree-based linkage studies (See, for 
example, Cardon and Bell [2]). Furthermore, they are potentially much better 
suited to tracking down genetic influences on common, complex diseases, which 
are believed to have substantial genetic components but for which these compo- 
nents are obscured by strong environmental influences and the likely interaction 
of many distinct genetic factors [19]. Despite some successes [13], however, these 
benefits so far remain largely hypothetical. 

One leading approach to improving the power of these studies is to rely on 
correlations between physically close SNPs by grouping SNPs into conserved 
haplotypes. By performing association studies on these haplotypes instead of 
individual SNPs, we can greatly reduce the number of distinct hypotheses be- 
ing considered in an association study, allowing us to lower our standards of 
proof for each hypothesis and thereby increase the power of the study. Some sta- 
tistical approaches have attempted to incorporate haplotype structure directly 
into association tests for this purpose [14, 21, 15, 12]. An alternative strategy 
is to characterize the haplotype structure in a recombining population indepen- 
dently of its application to a particular association test, a strategy that has been 
pursued from many directions [11, 7, 26, 27, 25, 28, 3, 20, 23]. 

One widely adopted approach to the direct characterization of haplotype 
structure stems from a seminal study by Daly et al. [4], which suggested that 
the human genome could be decomposed into segments of low haplotype diver- 
sity separated by regions inferred to be frequent sites of recombination. Locating 
the “haplotype blocks” of low diversity would allow one to reduce the complex- 
ity of the inference problem by working with haplotype block alleles instead of 
individual polymorphic sites. In addition, the use of “haplotype tagging” SNPs, 
which are subsets of SNPs in a block adequate to characterize a large fraction 
of its population diversity, could potentially significantly reduce the cost of con- 
ducting association studies without substantially hurting their power [10, 29]. 
A variety of methods have been proposed for defining optimal block decomposi- 
tions [4, 10, 18, 9, 6]. Optimal block decompositions can be efficiently computed 
for a broad class of objective functions, a task that can also simultaneously yield 
minimal informative SNP subsets for the given block decomposition [30]. Block 
decompositions also yield straightforward algorithms for inferring missing sites 
based on best-matching alleles within each block. 

While these discrete block decompositions are algorithmically convenient, 
though, they do not capture all of the available information that might be use- 
ful for downstream analyses. There are often correlations between consecutive 
blocks [6] , suggesting the presence of longer-range information than is captured 
by the block decompositions. Similarly, there may be finer structure the decom- 
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positions obscure. These additional sources of information can be expected to be 
much more pronounced when rarer alleles are included than in the Daly et al. 
and Gabriel et al. studies, as may be required for fine-scale mapping of genetic 
influences on common diseases. These factors argue for examining the feasibility 
of approaching the downstream problems using haplotype inferences that do not 
rely on discrete block decompositions. 

In this work, we assess the value of block- free predictions of shared haplo- 
type ancestry in characterizing the information content of haploid sequences. 
We accomplish this by applying a prior method of ours for block-free piecewise 
inference of ancestry [20] to two key sub-problems in the design and analysis of 
inference studies: inferring missing data and choosing informative SNP subsets. 
Inferring missing data based on sequence context provides a good test of our 
general ability to detect and apply statistical correlations found through hap- 
lotype context information. Locating reduced subsets of SNPs that allow us to 
characterize the remaining missing sites with high accuracy gives us an approx- 
imate idea of how much we can hope to reduce assay sizes while still adequately 
capturing the available sequence diversity. We develop simple methods for both 
problems and evaluate their performance using cross-validation studies of two 
real genetic datasets. We conclude that haplotype data provides a considerable 
amount of information usable for such problems but that there may be signif- 
icant limits to what can be accomplished given reasonable sizes of population 
samples, depending on the specific genetic region examined. 



2 Methods 

2.1 Predicting Sequence Ancestry 

We predict sequence ancestry using a method introduced in Schwartz et al. [20]. 
That work presented the problem of ancestry inference in terms of what we call 
“haplotype coloring,” coloring each site of a sequence so as to indicate from 
which of a set of ancestral sequences it is most likely to have descended at each 
polymorphic site. We presented two methods for the problem. The first, which 
was simultaneously introduced by Ukkonen [23], predicts ancestry by building 
on block decompositions. That method first finds a block decomposition for a 
set of sequences, then joins alleles in adjacent blocks using a maximum matching 
algorithm. The second method, which forms the basis of the present work, uses 
a restricted hidden Markov model (HMM) to represent the possible ways of 
combining ancestral sequences to yield a modern population. 

The basis of the HMM coloring method is optimization of a function ex- 
pressing the probability of generating an observed sequence and coloring given 
site-specific frequencies for ancestral haplotypes. To generate a modern sequence 
under this model, we first sample among all possible ancestral sequences, each 
of which has a characteristic starting probability. At each subsequent site, we 
either continue with the current sequence or undergo a recombination event, 
with a global uniform probability of recombination. If a recombination event 
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occurs between two sites, then we resample among all potential ancestral se- 
quences according to the characteristic site-specific frequency for each sequence 
to chose the ancestor at the next site. We also allow, at each polymorphic site, 
a uniform mutation probability of the generated sequence differing from its pre- 
dicted ancestor at that site. This model is similar to that used by Stephens et 
al. [22] for the related problem of haplotype phase inference, although it differs 
in the use of a global recombination probability and site- and sequence-specific 
ancestral haplotype frequencies. These changes are intended to reduce the data 
dependence of the learning methods by reducing the number of parameters to 
be inferred from quadratic to linear in the number of sequences, a change that 
comes at a cost in generality of the model. The log of the probability implied by 
this model, normalized by a factor independent of the coloring and frequencies, 
yields the following objective function: 

n 

G = //ii + ^ mD{shj , (Jj) + 

i=i 

n 

+ r)D{hj,h,.i) + log{{l - e") + ~ 

i=2 



where 

— n is the number of polymorphic sites per sequence 

— fij is the frequency with which haplotype i is chosen following a recombina- 
tion between sets j — 1 and j. 

— hj is the color assigned to site j of the target sequence 

— Sij is allele value of site j of reference sequence i 

— <jj is the allele value of site j of the target sequence 

~ TO is the log prior probability of mutation at any site 

~ r is the log prior probability of recombination between any two sites 

— D{a, 6) is 0 if a = 6 and 1 if a yf & 

In the above equation, the first term is the contribution of the probability of 
starting with particular color. The second term is the sum of mutation contri- 
butions accounting for errors in matching the predicted ancestors. The third is 
a sum of the log probabilities of two possible events at each site: choosing a new 
haplotype following a recombination event or sticking with the prior haplotype 
either because there was no recombination or because there was a recombination 
with a sequence sharing the same common ancestor at that site. The set of values 
hi . . .hn maximizing G is the maximum probability coloring for a single target 
sequence. Because sequences are colored independently of one another, finding 
the optimal coloring for each target sequence will yield the globally optimal col- 
oring for all sequences for a given set of frequencies. We solve for this objective 
by a Viterbi-like dynamic programming algorithm in which, for each site j and 
ancestral sequence k, we find the optimal coloring of a target sequence on sites 
1 through j terminating with color k. 
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We determine likely ancestral sequences from a modern population and es- 
tablish their site-specihc frequencies by an expectation-maximization algorithm 
[1]. The algorithm takes as input a single reference population and hnds a lo- 
cally optimal set of frequencies for that population and coloring in terms of those 
frequencies. Optimal coloring is achieved given a set of frequencies by applying 
the above dynamic programming algorithm for each sequence independently. Lo- 
cally optimal frequencies are derived at each site by a steepest descent search 
for frequency values maximizing the contribution to G at that site. We initialize 
the method by assuming that each sequence in the data set is ancestral and 
that its frequency at all sites is equal to the measured frequency of the full se- 
quence in the modern population. We then repeatedly apply the discrete coloring 
algorithm followed by the continuous frequency optimization until the method 
converges on a solution. This method is similar to the standard Baum- Welch 
method, although it is complicated by the interdependencies between transition 
probabilities introduced by our attempts to reduce the data dependence of the 
method. For more details on the haplotype coloring models and algorithms, see 
Schwartz et al. [20] . 

2.2 Missing Data Inference 

We can adapt the block-free HMM method straightforwardly to the problem 
of missing data inference with a minor modification of the probability model 
allowing for the scoring of unknown sites. We treat an unknown site value as 
a match to any allele at that site, thus eliminating the potential for mismatch 
penalties at the unknown site. This model is equivalent to assuming that all 
known alleles are equally likely at an unknown site, and thus each possible match 
incurs the same penalty. This assumption allows us to color sequences with 
unknown values using the dynamic programming algorithm described above, 
with only the change that D{shj,o'j) = 0 if Shj or aj is unknown. This change 
has no effect on the asymptotic run time of the coloring algorithm. 

Given the coloring of an unknown target sequence in terms of a reference 
population, we can then fill in likely missing values. For a polymorphic site of 
unknown allele value that has been assigned a color, we predict that the true 
allele value is the most frequent one among all known alleles assigned the same 
color in the reference population used in the EM frequency estimation. It might 
be more sound in terms of the model to chose the allele corresponding to the 
predicted ancestral sequence at the polymorphic site, even if it is not the most 
common allele for its predicted descendants. We chose the most common value, 
however, to allow for the possibility that the reference ancestral sequences may 
themselves have unknown values which can nonetheless be inferred based on 
their predicted descendants. 

2.3 Informative SNP Selection 

While much of the field has focused on SNP selection by blocks, creating an 
algorithmically convenient framework at the cost of some loss in usable infor- 
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mation content, we choose in this work to deal with the more general block-free 
framework and accept a cost in our ability to solve optimally and efficiently for 
the informative SNP selection problem. This decision gives us the freedom to 
work with a difficult objective function that closely matches a reasonable mea- 
sure of the value of the chosen SNP set. We define the optimal SNP set of a 
given size to be the SNP set that maximizes the number of sites outside that 
set that would be predicted correctly in our training data based only on sites in 
the chosen SNP set, using our coloring-based missing data inference algorithm 
of section 2.2. 

As we do not have an exact algorithm to find an optimal SNP set for our 
objective function, we worked with a simple heuristic method using a genetic 
algorithm. The hallmark of a genetic algorithm is a method for “mating” estab- 
lished solutions to produce new solutions that may yield better answers. Our 
mating method proceeds as follows: 

1. Choose two solutions at random. Si and S' 2 , presumed to each contain k 
SNPs. 

2. Chose a random crossing SNP site j. 

3. Construct a new solution S 3 = {s|(s < j A s G Si) V (s > j A s G S' 2 )} 

4. If I S3 1 < A: then add k— IS3I SNPs chosen uniformly at random from among 
all s ^ S3 to S3. 

5. If I S3 1 > k then remove k — IS3I SNPs chosen uniformly at random from 
among all s G S3 from S3. 

6. Return S3 as a new candidate solution. 

The mating method forms the core of an algorithm for heuristically generat- 
ing and testing possible solutions to attempt to find an optimal or near-optimal 
choice. We begin the overall algorithm by constructing an initial set of 20 candi- 
date solutions by choosing 20 SNP sets uniformly at random among all possible 
subsets of k our n SNPs. We then assign a score to each candidate solution by 
hiding all sites not in the candidate in our training set, predicting the hidden 
sites using the algorithm of section 2.2, and counting the fraction of hidden sites 
correctly predicted from the training data. Among the 20 ranked SNP sets, we 
keep the top half and produce an equal number of new sets by mating the 10 
old sets we keep, yielding 10 old and 10 new sets. We repeat this process for 100 
rounds for each value of k, finally returning the highest-scoring SNP set in the 
final round as our approximate optimum. 

While the use of a genetic algorithm for a computational genetics problem 
may seem an odd choice, we would in fact expect the model of meiotic mutation 
and recombination used by a genetic algorithm to be a good match to the nature 
of information locality in actual recombining sequences. As a result, the genetic 
algorithm appears likely to be a reasonable choice as heuristic methods go. 

3 Results 

We evaluated the methods using two real datasets: a set haploid sequences from 
22 biallelic variations (21 SNPs and one two-base deletion polymorphism) in 
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Fig. 1. Colorings of APOE and LPL datasets. In each image, rows of color 
represent distinct sequences and columns represent distinct polymorphic sites. 
Like colors within an image represent an inference of descent from a common 
ancestral sequence. Colors do not in general have any correspondence between 
images. A: coloring of the full APOE dataset. B: coloring of the full LPL dataset. 
C: coloring of the APOE training set. D: coloring of the LPL training set. 



192 chromosomes for the apolipoprotein E (APOE) gene [16, 5] and a set of 
computationally inferred haplotypes from 71 SNPs in 142 chromosomes for the 
lipoprotein lipase (LPL) gene [17]. 

We first established colorings for the two datasets. For both data sets, we 
used a mutation parameter of -4. For APOE we used a recombination parameter 
of -1 and for LPL a recombination parameter of -0.5, selected empirically based 
on a visual analysis of the colorings they yielded. Figures lA and IB show the 
colorings yielded for these parameters for each data set. To evaluate the method, 
we further randomly split each data set into two equal-sized subsets of chromo- 
somes: a training set and a testing set. Figures 1C and ID show the coloring 
derived for the same parameter sets using just the training data. For APOE, 
the training data alone yields the major features of population-wide haplotype 
structure found in the full data set. For LPL, some noticeable structural ele- 
ments from the full population are detected from the training data alone, while 
others are missed. 

We next used the data sets to evaluate the ability of the coloring methods to 
perform missing site inference. For each fraction of hidden sites, in increments of 
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5%, we created an artificial data set from the testing data by creating ten copies 
of each sequence in the testing data and independently at random hiding the 
chosen fraction of hidden sites in each sequence. Finally, we applied the block- 
free coloring method to infer an HMM on the training set and used this HMM 
to color the artificial testing set sequences and fill in hidden sites based on the 
coloring. Finally, we evaluated the accuracy of the predictions compared to the 
true values of the hidden sites in the testing data. 

Figure 2 shows the effectiveness of the methods on the two data sets data. We 
used two measures of quality: accuracy per base in predicting individual hidden 
sites and accuracy per sequence in predicting all hidden sites in a given sequence 
correctly, each as a function of the fraction of sites hidden. For APOE, per- 
base accuracy remains consistently high for all numbers of hidden sites, largely 
reflecting the fact that many polymorphisms have relatively rare minor variants 
in this data set. Per-sequence accuracy, however, shows a gradual decline as a 
greater fraction of sites are hidden. LPL shows a more nuanced profile, consistent 
with its greater diversity in the population. Per-base accuracy shows a slight 
decline with the fraction of sites hidden until reaching a plateau of about 80% 
accuracy at about 65% of sites hidden. Full-sequence accuracy falls off far more 
steeply, dropping rapidly up to about 20% hidden sites, then more gradually 
decreasing to zero. 

We evaluated the effectiveness of the informative SNP selection method with 
a similar protocol to that used for missing data inference, judging the ability of 
the methods to infer hidden sites when those sites are computationally chosen. 
We used the same partition of APOE and LPL data sets into testing and training 
sets and the same program parameters as with missing site prediction. Instead 
of randomly hiding sites, however, we used the SNP selection method described 
in section 2.3 to choose the sites to be hidden based on the training data. We 
then evaluated our ability to predict the hidden sites using the testing data. We 
repeated this analysis for each possible number of hidden SNPs sites (0-22 for 
APOE and 0-71 for LPL). 

Figure 3 shows the results of the missing site prediction using informative 
SNP selection for the two data sets. The graphs show a substantial improvement 
in accuracy compared to the results for randomly hidden sites. In both cases, the 
gradual decay approaching a plateau seen with per-sequence accuracy for random 
data is replaced by a slow decay at lower numbers of hidden sites followed by an 
increasingly steep decline for larger numbers of hidden sites. Per-site accuracy 
is also substantially improved for both until almost all sites are hidden. The 
benefits of selected versus random hidden sites is more pronounced for APOE 
than for LPL. In APOE, the accuracy drops minimally until 80% of sites are 
hidden, then falls sharply. For LPL, there is a continual decay, but it is much 
slower than when random hidden sites are chosen. The jerkiness of the LPL 
graph compared to the APOE graph appears primarily to reflect the fact that 
LPL’s training set is less predictive of the haplotype patterns in its testing set 
than is the case with APOE. 




70 



R. Schwartz, A.G. Clark, and S. Istrail 



Prediction Accuracy on APOE with Random Hidden Sites 



\ 



per-site accuracy - 
per-sequence accuracy 



A 



percent sites hidden 



B 



Prediction Accuracy on LPL with Random Hidden Sites 




percent sites hidden 



Fig. 2. Results of missing data inference. Graphs show accuracy in predicting 
individual sites and complete sequences as a function of the fraction of hidden 
sites for each data set when sites are hidden independently at random. A: results 
on APOE. B: results on LPL. 
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Fig. 3. Results of SNP selection. Graphs show accuracy in predicting individual 
sites and complete sequences as a function of the fraction of hidden sites for 
each data set when hidden sites are computationally selected to optimize per- 
site accuracy in the training data. A: results on APOE. B: results on LPL. 



4 Discussion 

We have presented computational methods for approaching some key problems 
in understanding haploid genetic data and applying it to the design and anal- 
ysis of association studies. We showed how a previously developed method for 
piece-wise ancestry prediction could be adapted to the problems of inferring 
missing data and choosing informative SNP subsets. We further showed results 
of the ancestry prediction itself and its application to the two problems for two 
gene regions. These results indicate that there is substantial redundant informa- 
tion contained in sequences of polymorphic sites that we can exploit by using 
haplotype sequence context. They also suggest, though, that the value of such 
methods is likely to vary significantly between different genetic regions and may 
be severely limited for some regions if population samples are of inadequate size. 
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We can draw several general conclusions from our results about the prospects 
for haplotype-based studies. The comparison of randomly hidden to selectively 
hidden SNPs indicates that the right choice of SNPs can substantially improve 
our ability to infer the hidden sites, lending support to the idea of choosing 
haplotype tagging SNPs as a way to reduce assay costs [10]. On the other hand, 
the results, for LPL in particular, suggest that there may be limits on our ability 
to predict missing sites for reasonable sizes of data sets. Although the ability 
to predict hidden sites is a more stringent test than might be necessary for 
our true goal of association study design, it does provide an estimate of how 
well any given subset of sites captures the information content of the full set. 
For APOE, we can characterize about 80% of the population with only 25% 
of the SNPs and about 90% of the population with about 50% of the SNPs, 
although we require almost all SNPs to get close to 100% of the population. 
For LPL, the picture is more pessimistic, with hardly any resolving power until 
about half of the SNPs are used and with almost all SNPs required to reliably 
distinguish over 90% of sequences. The inability of the method to capture the 
last 10% of population diversity when only a few SNPs are hidden largely reflects 
the fact that the training data is not adequate for characterizing a significant 
fraction of sequence variants that are found only in the testing data. For any 
initial set of polymorphic sites sampled and any population sample size, we can 
expect similar absolute limits on the value of informative SNP selection methods 
determined by how well the sampled data characterizes the variability in the full 
population. As the differences between our results on the two data sets illustrate, 
the magnitude of the problem of inadequate sampling can vary considerably 
depending on local properties of particular genes of interest. These conclusions 
seem unlikely to be dependent on the specifics of our methods, but rather are 
likely to apply to all methods for these problems. They should therefore be 
considered in future assessments of methods for informative SNP selection, by 
performing cross-validation in evaluating methods and by explicitly building 
models of statistical significance of chosen SNP sets into the methods themselves. 

There are many avenues by which this work can be continued. The basic 
models for ancestry detection are in some ways highly simplified. For example, 
they do not adjust for variations in recombination or mutation propensity across 
the genome or explicitly account for many specific genetic processes, such as 
hyper-variable sites, recombination hotspots, or gene conversion. More detailed 
models that incorporate a greater range of existing knowledge about the origins 
of genetic variation may perform better in practice. On the other hand, the mod- 
els are too dependent on unknown parameters — particularly the user-supplied 
mutation and recombination rates — and are likely to be more practically use- 
ful if these parameter can be eliminated or automatically inferred. The actual 
methods for the problems could also likely be significantly improved, particularly 
the crude heuristic method used in this work for informative SNP selection. We 
might also consider extensions of these methods for unphased data or pooled 
data sets from multiple individuals. The measures used to test accuracy — per- 
site and per-sequence — are both imperfect and could use refinement. More 
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generally, the practical importance of these problems suggests a need for estab- 
lished benchmarks of performance under various conditions, such as levels of 
genetic diversity and population sample size. 
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Abstract. Data on the human genome are showing conserved haplotype 
blocks. In order to assess whether these are the footprints of selection or 
reflect hotspots of recombination, it is first necessary to understand the 
effects of population history and demography on the survival of ances- 
tral chromosome segments in the absence of these phenomena. In this 
paper, we present results on the mean and variance of lengths of an- 
cestral segments and shared segments in chromosomes from a current 
population. In a small population, patterns of growth and population 
subdivision may have a substantial impact on observed patterns of chro- 
mosome segments, and the length distributions have high variance and 
are heavy-tailed. 



1 Introduction 

The increasing availability of data on haplotypic variation in human populations, 
has provided information on patterns of allelic association [1] and evidence for 
the existence of conserved blocks of ancestral haplotype [7]. While the block 
structure of parts of the human genome may be the result of recombination 
hotspots [12], haplotypic variation also carries the imprint of geographic [16] 
or selective history [9,10]. However, in order to interpret patterns of haplotypic 
variation, it is first necessary to know the range of patterns expected in the 
absence of such factors. 

Genomic data are also becoming increasingly available on other species, in- 
cluding the relatives of endangered species, making it possible to study the 
genome-wide variation in small remnant populations [13,14]. Previously stud- 
ies of the maintenance of genetic variation in endangered species has focused on 
single genetic loci [8] . New data, models and methods make it possible to study 
similar questions at the chromosome and genome level. 

Thus it is timely to examine the probability distributions of lengths and num- 
bers of ancestral chromosome segments in small populations, and the impact of 
population growth and population subdivision on these probability distributions. 
Since only extant populations can be studied, inference of shared ancestry and 
conservation of ancestral genome is made through comparisons of genome shar- 
ing among chromosomes of a current population. Thus it is also necessary to 
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consider the probability distributions of tracts of genome shared identical by 
descent (ibd) between chromosomes sampled from a current population. 

This paper is based on results in the thesis of Chapman [2] . Details of deriva- 
tions and additional results may also be found in Chapman and Thompson [3,4] . 

2 Chromosome Segments and the Theory of Junctions 

While simulation under the coalescent with recombination [11] can provide re- 
alizations of the ancestral structure of haplotypes under a wide variety of de- 
mographic and genetic scenarios, we have preferred to take a more traditional 
approach deriving from Fisher’s Theory of Junctions [6]. This approach provides 
results under models of random genetic drift: the predictions can be compared 
to data observations. 

Fisher [6] defined a junction in a chromosome as the boundary point between 
segments of distinct founder origin. Junctions are formed by crossovers between 
two chromosomes that differ in founder origin at the crossover location. Once 
formed, a junction segregates, becomes extinct, or becomes fixed, with prob- 
abilities determined by the laws of Mendelian segregation. Whereas Fisher [6] 
considered regular mating systems, we [3] used the theory of junctions to inves- 
tigate the effects of population growth and subdivision on the expected number 
of junctions per Morgan on a chromosome, and hence the lengths of ancestral 
chromosome segments. Some results of this work are summarized in section 3. 

In analyzing the genomic variation in a population, only extant chromosomes 
can be compared. The ancestral chromosomes are normally unknown. Thus it is 
important to consider ihd between extant chromosomes, relative to some ances- 
tral or founder population. In comparing two chromosomes, Fisher [6] defined a 
junction is external if it bounds a region of gene identity by descent {ihd). Oth- 
erwise, it is internal. Following [2] we refer to a length of chromosome ihd in a 
pairwise comparison as a tract. A tract is composed of one or more inter-junction 
segments. 

Stam [15] extended Fisher’s Theory of Junctions to a random mating popu- 
lation of constant size, deriving a formula for the expected number of external 
junctions per Morgan, as a function of time and population size. Under a Markov 
assumption for the process of ihd along a chromosome, this expectation then de- 
termines the distribution of lengths of ihd tracts. Chapman and Thompson [4] 
had two main goals: first, to extend to the case of subdivided populations Stam’s 
derivation of the expected number of external junctions per Morgan in a random- 
mating population, and second, to construct a more realistic model for the ihd 
process along a chromosome, and hence for the distribution of lengths of ihd 
tracts. Some results of this work are presented in sections 4 and 5. 

3 The Number of Ancestral Chromosome Segments 

Crossover events in offspring gametes occur (by definition) at rate 1 per Morgan, 
and a crossover results in a junction unless the homologous chromosomes are ihd 
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at the crossover point. A junction originating in a generation at which there are 
2N chromosomes in the population has chance 1/2N of being present in any 
given chromosome at any later time. Thus the expected number of junctions in 
a chromosome is easily determined, requiring only the pointwise probabilities 
of gene ibd under the specified demographic model. The number of ancestral 
segments is one greater than the number of junctions J, and (1 + E(J))“^ is a 
lower bound on the expected length of an ancestral chromosome segment. 

If junctions formed, segregated, and survived, independently of each other, 
then the number present in any given chromosome would have a Poisson distri- 
bution with the variance equal to the mean. In fact there are several sources of 
dependence which serve to inflate the variance of junction number J. First, vari- 
ation among individuals in the level of ibd causes dependence in the formation of 
junctions at a given generation. Meioses from individuals with lower ibd give rise 
to more junctions. Second, dependence in the temporal variation on ibd gives rise 
to dependence in the number of junctions formed in different generations. At a 
given time, a low number of junctions formed suggests a population with higher 
ibd, hence higher ibd in subsequent generations, and low junction formation at 
these future times. Finally, the presence of a junction in a given chromosome at 
a given time is not independent of the presence of other junctions. For exam- 
ple, junctions formed close to one another in the same meiosis will tend to be 
inherited together. 




Fig. 1. Variance of the number of junctions: see text for details 



Whereas the third source of variance seems impossible to deal with analyti- 
cally, the first two may be addressed. By considering explicitly the covariances 
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due to variation in ihd in a population among individuals and over time, Chap- 
man [2] obtained expressions for the variance of the number of junctions in a 
sampled chromosome. This variance approximation depends on the chromosome 
length and on the population sizes over time, through the single-locus and two- 
locus probabilities of non-ibd of Weir et al. [17] . Figure 1 shows one example of the 
effects of dependence on the variance of the number of junctions, in a randomly 
selected chromosome from a population of constant size N = 20, estimated by 
(i) simulation, (ii) ihd computation, (iii) Poisson approximation. 

The formulae developed by [3] may be applied to any growth scenario for a 
population, and also to subdivided populations. Table 1 shows the means and 
standard deviations of the number of junctions in a chromosome of length 1 
Morgan in subdivided exponentially growing populations. The populations all 
have the same total growth, by a factor of 100 over 100 generations (4.71% 
per generation). The subdivided populations bifurcate when they reach 8, 4, 
and 2 times the initial size, respectively. The standard deviations given here are 
those from the ihd formulae developed by [2]; for the Poisson approximation the 
variance is equal to the mean. Additional results are given in [3]. 

Table 1. The number of junctions per Morgan in subdivided populations 



mean (SD) Initial Population Size 

Subdivision No = 20 Nq = 100 
None 64.9 (9.0) 91.7 (9.6) 

Split at 8No 62.8 (8.8) 90.9 (9.6) 

Split at 4Ao 57.3 (8.4) 88.9 (9.5) 

Split at 2Nq 45.0 (7.8) 83.5 (9.2) 



For small populations, growing from 20 to 2000 over 100 generations, the 
effect of subdivision is marked. An undivided population has 50% more junctions 
than the most subdivided population, and thus lengths of ancestral chromosome 
segment will be 50% longer in the more subdivided population. For an initial 
size population 100, the effects are less marked, but the same patterns are seen. 

4 The Expected Lengths of ibd Tracts 

External junctions are those that bound a tract of ibd between two chromosomes. 
The derivation by Stam [15] of the expected number of external junctions per 
Morgan, E(X), in a random mating population is thus key to considering the 
number and lengths of ibd tracts in a comparison of two extant chromosomes. In 
such a population, the pointwise probability of ibd, tt/, is also easily computed 
as a function of time and population size. If ibd is assumed to be Markov along 
a chromosome, ttj and E(A) together provide the complete distribution of the 
lengths of the alternating ibd and non-ibd tracts. Let the expected lengths be 
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A/ and Xn respectively. Because of the Markov assumption, each of the ibd 
and non-ibd tracts has an exponential distribution with these means. Further, 

= A//(A/ + Xn)- For every recurrence to the ibd state, there are two external 
junctions. Hence E(X)/2 = (A/ + Xn)~^, or A/ = 27T//E(X). 

The assumption that ibd is Markov along a chromosome does not provide a 
good model for the distribution of ibd tract length: a better model is considered 
in section 5. However, for the mean lengths, the formula A/ = 27T//E(X) is 
dependent only on very broad assumptions, and will still hold for our new model. 
Thus, first, we use this formula to consider the effect of population subdivision 
on mean ibd tract lengths. Specifically, [4] show how Stam’s derivation of the 
expected number of external junctions in a pairwise comparison of chromosomes 
may be extended to chromosomes sampled from the same or different randomly- 
mating subdivisions of a total population. 
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Fig. 2. The expected lengths of ibd tracts in subdivided populations 



The same growth and split scenarios are considered as in the previous sec- 
tion. Populations grow exponentially from Nq = 20 or Nq = 100 individuals 
by a factor of 100 over 100 generations. A (sub) population bifurcates into two 
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equal parts when it reaches size 8 Nq, 4Nq or 2Nq. Thus bifurcations occur ev- 
ery 15 generations, but the less subdivided populations do not have their first 
bifurcation until they reach the requisite size at generation 30 or generation 45. 
Figure 2 shows the resulting expected ibd tract lenths for pairs of chromosomes 
sampled in the final population (generation 100). The points at a given popula- 
tion split time, show the expected lengths between a pair of final chromosomes, 
one sampled from each of a pair of subdivisions whose most recent common an- 
cestral population was at this time point. For example, for a population starting 
at Nq = 20 and with every subpopulation bifurcating each time the size reaches 
40, chromosomes from each of a pair of subdivisions whose most recent common 
ancestor was at generation 60 have a mean ihd tract length of just over 3 cen- 
tiMorgans. Under the same scenario (the top curve), if the chromosomes derive 
from different daughter-populations of the earliest split, the expected length is 
only half this amount. With no subdivision, the expected lengths are 1.29 cM 
(No = 20) or 0.73 cM (Nq = 100). Additional results are given by [4]. 

Figure 2 shows there is substantial variation in expected ibd tract length 
among chromosome pairs sampled from subpopulations with different degrees 
of coancestry, particularly when the total population is small and the degree of 
subdivision is high. In reality, natural populations are composed of many (rec- 
ognized or unrecognized) more or less complete subdivisions, and this variation 
in mean ibd tract length is a substantial component of the overall variation in 
tract length. Moreover, each distribution itself has a high variance, as will be 
considered in the next section. 



5 The Distribution of Lengths of ibd Tracts 

The assumption of [15] that ibd is Markov along a chromosome underestimates 
the variance of ibd tract lengths. A better model for the process of junctions 
and ibd along a chromosome may be developed as follows. Junctions in a pair 
of current chromosomes may be classified into one of four types. They may be 
shared (S) and hence internal to an ibd tract; they may be unshared and internal 
to a non-ibd tract (U); they may be external and either initiate (V) or terminate 
(T) an ibd tract. The upper part of Figure 3 shows first a pair of chromosomes: 
different shadings indicate different founder origins. Next the figure shows a 
classification of the junction types in these two chromosomes and an indication 
of the two ibd tracts (shaded black) between them. 

Each ibd {non-ibd) tract starts with a type V (T) junction, is followed by 
some number (possibly 0) of S (U) junctions, and ends with type T (V) junction. 
Rather than assuming a Markov process for ibd, we assume that the junction 
types form a Markov chain. The possible transitions and their probabilities are 
shown in Figure 4. Junctions of type S (and hence also U) tend to cluster in 
the genome: since the junction is shared, other junctions that existed in the 
immediate vicinity on that original chromosome are also expected to be shared. 
This implies that 62 < P 2 , a result also confirmed by simulation [4]. 
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Fig. 3. Pairwise comparison of chromosomes: for details, see text. 




Fig. 4. The Markov chain model of junction types 



Assuming the inter-junction segment lengths are independent of each other, 
the ibd process along the chromosome is an alternating renewal process [4] , and 
Stam’s formula. A/ = 27r//E(A'), for the expected length of an ibd tract still 
holds. However, the new model is far more flexible. The segments between dif- 
ferent pairs of junction types can have different arbitrary distributions. Simula- 
tion provides the distribution of segment lengths shown in Table 2. In fact, all 
the non- ibd segment types have very similar length distributions, and so also do 
the three segment types involving S junctions. Shared junctions are, on aver- 
age, older and more frequent in the population: older segments are longer, and 
frequent segments less likely to be broken up by subsequent recombination. Al- 
though the V-T ibd segments are shorter than the other ibd segment types, as 
a first approximation we assume ibd segment lengths are independent, and all 
have mean length fij and variance aj. 

From the Markov model for the sequence of junction types, the number of 
segments in an ibd tract is AT -I- 1 where Pr(AT = 0) = P 2 , and given AT > 0 it has 
a geometric distribution with parameter (1 — 62 ) (Figure 3). Hence 
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Table 2. Relative lengths of different segment types 



Segment type 


ibd or not 


length description 


T-V 


non-ibd 


shortest 


T-U, U-U, U-V 


non-ibd 


short 


V-T 


ibd 


long 


V-S, S-T 


ibd 


longer 


S-S 


ibd 


longest 



Under our current model, the variance of the length of an ihd tract is 

ajim) + 1 ) + - im)r), 

while under the model of [15] the variance is the square of the mean: 

njim) + !)"• 

The difference between these two expressions gives the increase in variance for 
out model relative to than of Stam [15]. This difference is 

Both parts of this expression are positive. Segment length has a heavy-tailed 
distribution relative to an exponential, and thus aj > /ij. Shared junctions are 
clustered, and thus p 2 > 62 - In fact, our new formula is likely also an under- 
estimate of the variance, although it can provide a substantially greater value 
than Stam’s. For example, we have not taken into account the length variation 
of different ibd segment types. 



6 Observability, Detectability, and Identifiability of 
Junctions from Data 

Given a sufficiently polymorpic founder population and sufficiently dense mark- 
ers, ibd is observable. Hence, ibd tracts and external junctions are, in principle, 
detectable. Internal junctions, whether within ibd tracts or within non-ibd tracts 
are not detectable from a comparison of the two extant chromosomes. 

The lower part of Figure 3 shows what happens when a third chromosome 
is added to the comparison. In this example, the third chromosome shares an 
indicated segment ibd with each of the other two. As additional chromosomes 
are sampled from a population, there are additional junctions present. Either 
new or current junctions may become external in some comparison, and hence 
detectable. In order for a junction to be detectable a copy of at least one of the 
two ancestral chromosomes which recombined to form the junction must exist in 
the sample, but this is insufficient to identify in which chromosome the junction 
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occurs. To identify which extant chromosome carries the junction and which is 
ancestral, both ancestral chromosomes must be present in the sample. Figure 3 
shows an example of one such fully identifiable junction. Although with three 
chromosomes many junctions are not detectable and few are identifiable, the 
number of pairwise comparisons increases quadratically with sample size, and 
in small populations a high proportion become detectable as more chromosomes 
are sampled [2]. 

7 Discussion 

Even in a single random-mating population, patterns of coancestry lead to de- 
pendence in junction origins and histories, and an extant population contains 
junction of varying ages relative to a founder population. Thus the variance of 
junction number is substantially inflated over that provided by a Poisson distri- 
bution, and the lengths of genome segments between junctions thus also have 
higher than anticipated variance. Population growth histories and subdivision 
affect not only mean lengths of segments, but also the variance. 

In comparing extant chromosomes, the length of ihd tracts has high variance, 
due both to the heavy-tailed distribution of segment lengths and also to the 
clustering of shared junctions. In a partially subdivided population, these effects 
are enhanced. Natural populations contain chromosomes which result from a 
mixture of a wide variety of histories. It has been suggested [5] that “genomic 
control” provides a method of interpreting observed patterns of haplotypic vari- 
ation and allelic association. That is, within any population a comparison is 
made among different regions of the genome, since all regions have been subject 
to the same population demography. In making such comparisons it is impor- 
tant to recognize the huge variation in actual histories of chromosome segments 
encompassed within even a simple demographic model, and the consequent vari- 
ability expected among genome regions, even in the absence of selection and of 
recombination hot-spots. 
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Abstract. A new method is presented for use in simulating samples of 
disease and normal chromosomes bearing multiple linked genetic mark- 
ers under a neutral model of mutation, genetic drift, and recombination. 
The method accounts for the potential effects of investigator sampling 
bias by allowing for ascertainment of chromosomes according to disease 
status and of markers according to a pre-specified polymorphism cutoff 
level. The method was implemented in a computer program and applied 
to study the general effects of disease mutation age (or frequency), lev- 
els of marker polymorphism, and sample size, on pairwise LD between 
markers and a disease mutation. It is shown that the average pairwise 
LD between a marker and a disease mutation is lower for older, or more 
prevalent, disease mutations, as expected. The marker polymorphism 
cutoff level also has an important influence on LD. Potential applications 
of the method for predicting the power of genome-wide marker-disease 
association studies are discussed. 



1 Introduction 

With the advent of a human haplotype map initiative [1] and the emerging 
prospect of large amounts of data on haplotype frequencies and linkage dise- 
quilibrium (LD) in the human genome [2, 3,4, 5], as well as future large-scale 
projects aimed at mapping genes influencing complex diseases by marker-disease 
association analysis [6], it has become increasingly important to understand the 
processes determining human genetic variation. Of particular interest is the influ- 
ence of underlying evolutionary forces [7,8] and sample ascertainment strategies 
[9,10] on variation observed at commonly used genetic markers, such as single 
nucleotide polymorphisms (SNPs) and microsatellites. A related question con- 
cerns the potential power of disease-marker association studies that rely on link- 
age disequilibrium (LD) to locate disease susceptibility genes [6,11,12]. Recent 
studies have used computer simulations to address these, and other, questions 
[7,10,13,14]. 

Most simulation studies generate samples of chromosomes under a neutral 
coalescent model, taking account of the population processes (mutation, recom- 
bination, genetic drift, etc) that determine patterns of marker polymorphism and 
linkage disequilibrium [10,15]. Such studies are useful for predicting the patterns 
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of neutral variation one would expect to see in a random sample of chromosomes 
from a population, but the genetic variation will be different in a case-control 
study because individuals are ascertained based on disease status. Krugylak [7] 
attempted to take account of disease allele frequency by only accepting genealo- 
gies with a mutation present at a frequency in the sample that was equal to the 
required population frequency of the disease mutation. There are two problems 
with this design: (1) for small samples the sample frequency of a mutation will 
deviate considerably from the population frequency; (2) disease studies typically 
enrich the sample for a disease allele by ascertaining for affected individuals and 
the frequency of a disease allele in a sample will therefore be much greater than 
its frequency in the population. A case-control sampling strategy enriches the 
sample for chromosomes descended from one or more disease mutations and this 
has the effect of altering the shape of the underlying genealogy causing it to 
differ from that expected under a neutral coalescent model [14,16]. There is also 
an ascertainment effect for markers because SNPs and/or microsatellite markers 
are chosen from panels of markers known to be polymorphic [9,10]. 

To examine many of the above questions (e.g., the power of case-control 
association studies for mapping disease mutations, etc) via simulation studies, a 
new simulation method is needed that allows samples to be generated under a 
coalescent process with a case-control sampling strategy and polymorphic marker 
ascertainment. Here we outline a method to simulate samples under a coalescent 
process that allows for these sources of ascertainment bias. Our basic strategy is 
to simulate the sample path (over time) of the population frequency of a disease 
allele, using a diffusion approximation, and then to simulate the coalescent of a 
sample of chromosomes conditional on the sample path of the allele frequency 
(using theory for the coalescent process in a population of variable size). This 
improves upon an earlier method for simulating a coalescent process with disease 
ascertainment proposed by Zollner and von Haeseler [14], eliminating several key 
assumptions in their model which will often be violated in practice. In particular, 
they assumed that the frequencies of disease mutations remain constant over 
time; our diffusion simulation eliminates the need for this assumption. 



2 Theory 

The ancestral processes of lineage coalescence and recombination (within a pan- 
mictic population) traced backwards in time can be represented as a graph, with 
recombination events splitting a chromosome to create two ancestors (each carry- 
ing only a segment of the descendent chromosome) , and coalescence events unit- 
ing pairs of chromosomes to generate a common ancestral chromosome [17,18]. 
As is usual, this will be referred to as the “ancestral recombination graph.” 
One can simulate population samples of chromosomes and genetic markers by 
simulating the ancestral recombination graph and then simulating independent 
mutations on the lineages of this graph according to a Poisson process. The stan- 
dard model of a coalescent process with recombination and mutation assumes 
that chromosomes are a random sample from a population (i.e., each is equally 
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likely to be sampled) and that the markers are a random sample of loci that 
may have any number of alleles segregating in the population, including a single 
allele (no polymorphism). 



2.1 Coalescent Process with Disease and Marker Ascertainment 

Here we propose a procedure for simulating genetic markers on chromosomes 
under a coalescent and recombination process taking account of the effects of 
both disease and marker ascertainment. Our simulation method is similar to 
that of [14] but uses fewer approximations and relaxes some assumptions. It is 
assumed that no heterogeneity exists at the disease locus (e.g., all chromosomes 
bearing the disease mutation descend from a single ancestral chromosome on 
which the disease mutation arose), although this assumption can be relaxed by 
simulating multiple disease allele genealogies. Let tiq be the number of disease 
chromosomes sampled and let be the number of disease chromosomes an- 
cestral to the sample at generation t in the past. Let tIq being the number of 
normal chromosomes sampled and let be the number of normal chromosomes 
ancestral to the sample at generation t in the past. 

Let Nq be the present population size, let Nt be the population size at gen- 
eration t in the past, and let pt be the frequency of the disease mutation at 
generation t in the past. We define L to be the total number of marker loci 
examined and simulate either SNP markers with mutation rate pnc = 10“® or 
microsatellite markers under a stepwise mutation model [19,20] with mutation 
rate pmt = 10“^. Other models of mutation could be easily incorporated. The 
map distance between marker 1 and marker L is defined as p, the distance be- 
tween markers i and j is defined as pij , and the location of the disease mutation, 
denoted as 9, is defined as the distance of the disease locus (in map units) from 
marker 1. 



2.2 Ancestral Graph with Disease Ascertainment 

The coalescent and recombination processes are simulated jointly. The process 
is illustrated in figure 1. Note that some time after the generation at which the 
disease mutation arose, the disease chromosomes share a most recent common 
ancestor (MRCA). After the time of origin of the disease mutation, disease chro- 
mosomes only coalesce with one another as do the normal chromosomes. Recom- 
binations, on the other hand, can occur both within, and between, genealogies of 
the disease and normal chromosomes. If a disease chromosome recombines with 
a normal chromosome, this increases the rate of the coalescence-recombination 
process in the genealogy of normal chromosomes by one and vice versa. If a 
recombination occurs between two disease chromosomes, or two normal chromo- 
somes, the effect is to increase the rate by one in the respective class in which the 
recombination event occurred (Figure 1). To obtain the waiting times between 
events, as well as the type of each event, we simulate three waiting times: (1) the 
time until a coalescent event occurs in the sample of disease chromosomes; (2) 
the time until a coalescence event occurs in the sample of normal chromosomes; 
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Fig. 1. An example genealogy of cases (inheriting a disease mutation) and con- 
trols. The age of the most recent common ancestor (MRCA) of the cases is 
denoted as T and the age of the disease mutation as T’. Marker mutations are 
indicated by Xs and locations of recombination events by vertical lines on a hor- 
izontal line representing the segment of chromosome spanned by markers. The 
states of 3 biallelic markers are indicated by Os (ancestral) and Is on chromo- 
somes at the tips of the genealogy. 



and (3) the time until a recombination event occurs in the ancestry of either the 
normal chromosomes, or the disease chromosomes. To facilitate the simulation of 
the disease allele frequency over time, we use a discrete-time model, rather than 
a continuous time model as is usual for the coalescent process. The probability 
density function (pdf) of the time until a coalescence event occurs in the gene 
tree of the disease chromosomes, given that the previous event (coalescence or 
recombination) occurred at time tg, is 



foito) 




( 1 ) 



In our simulations, we assumed that the population has grown at an exponential 
rate r [21,22]. In that case equation 1 becomes 
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The pdf of the time until a coalescence event occurs in the gene tree of the 
normal chromosomes, given that the last event (coalescence or recombination) 
occurred at time to, is 
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The pdf of the time until a recombination event occurs is 
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A method is described below for generating random variables from each of the 
density functions of equations 2, 3 and 4 above using either the usual inverse 
transformation method or a recursive equation. 



2.3 Method for Simulating Events in the Ancestral Graph 

To simulate the waiting time, until a coalescence occurs in the sample of 
normal chromosomes, for example, using the inverse transformation method, we 
would first simulate a uniform random variable U € [0, 1] and then solve for In* 
in the equation 

U = 1 
= 1 

where Fn^In) is the cumulative density function of In. Waiting times until 
recombinations are simulated from the exponential density of equation 4 using 
the inverse transformation method. If < Im and < tn, the next event is the 
coalescence of a random pair of chromosomes in the genealogy of the sampled 
disease chromosomes at time tn- tN < to and In < Ir, the next event is the 
coalescence of a random pair of chromosomes in the genealogy of the sampled 
normal chromosomes at time tv- Otherwise, the next event is a recombination 
involving a randomly chosen chromosome at time t/j. 

In order to speed up simulation of the time until coalescence, instead of faith- 
fully following the CDF and using inverse transformation, we make use of the 
probability that a coalescence occurs at time — 1 to calculate the probability of 
a coalescence at time t* . To simulate the genealogy of the disease chromosomes, 
for example, this is done by multiplying a value. 
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where C is equal to {n^D — 1)/ (4Aq) . 

Using the above recursion to calculate the probability for each generation, the 
simulation procedure can be expressed as follows: 
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1. Generate a random number U, between 0 and 1. 

2. Let j = to, Q= F = Q. 

3. If U < F, return t* = j. Otherwise, go to step 4. 

4. According to equation 6, Q = Q x exp|— C ^ ~ 

F + Q, j = j + 1. 

5. go to Step 3. 

Once the disease chromosomes coalesce to a MRCA, the rate of coalescence 
in the genealogy of disease chromosomes is zero until either the time is reached 
at which the disease mutation arose, or a recombination occurs. If the sample 
of normal chromosomes coalesce to a MRCA before the disease mutation arose 
the rate of coalescence in the genealogy of normal chromosomes is also zero until 
either the time is reached at which the disease mutation arose, or a recombination 
occurs. At times, in the past, greater than the age of the disease mutation the 
rate of the coalescence process is scaled by nf + 1 until the common ancestor 
of the disease chromosomes coalesces with a normal chromosome; the rate is 
then scaled by . If a recombination event occurs during the simulations the 
position of the recombination event is chosen uniformly on the interval of length 
p spanning the markers; this assumes that if recombination hotspots are present 
they are accounted for by the map lengths of the intervals (other more complex 
models of recombination with explicit hotspots could also be used) . 



2.4 Diffusion Model of Disease Allele Frequency 

In our simulations, we fix the age of the disease mutation, and then simulate the 
change of frequency of the mutation over time, conditional on non-extinction. 
To simulate the change of frequency of the disease mutation we assume that the 
process can be modeled using a diffusion approximation [24]. We simulate the 
diffusion process using a procedure suggested by Kimura and Takahata [25] . The 
basic idea is that the allele frequency at the next generation, given the current 
frequency, is normally distributed with expectation and variance determined 
by the diffusion model. If population size is constant, N, then E[pt_|_i] = pt and 
o'pt+i = Pt(l~Pt)/2A. If the population size is growing exponentially with rate r 
and current population size Nq then E[pt+i] = pt and = pt(l— P()/(2Aoe'’*). 

If an allele has fewer than 4 copies in the population, then the number of alleles 
in the next generation is instead simulated as a Poisson random variable, using 
the branching process approximation for a rare allele [24], with parameter xe'" . 
This is because the diffusion approximation is no longer accurate in this case. In 
our simulation method, the initial copy number of the disease mutation when it 
arises is a; = 1. 



2.5 Models of Mutation and Marker Ascertainment 

Mutations are simulated on the ancestral recombination graph generated ac- 
cording to the procedure outlined above. Conditional on the graph (genealogy). 
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the distribution of the positions of mutations on branches is uniform under a 
Poisson process model of mutation. However, in a coalescent simulation with 
recombination, the only relevant mutations are those that occurred on portions 
of ancestral chromosomes that were transmitted to one or more chromosomes 
in the sample (recall that with recombination only certain segments of recombi- 
nant chromosomes are ancestral to the sample). To improve the efficiency of our 
simulation procedure, we therefore used a simple algorithm to label markers on 
those chromosomal segments (branches of the ancestral recombination graph) 
that leave descendents in the sample. 

The basic algorithm proceeds as follows: (1) all markers on all sampled chro- 
mosomes are initially labeled with integer 1; (2) the network is traversed from 
the tips to the final MRCA; (3) at each recombination event markers peripheral 
to the recombination point are labeled with integer 0 and if a marker already is 
labeled zero it retains its label regardless of its position relative to the current 
recombination point; (4) at each coalescence event if markers at a locus on the 
two coalescing chromosomes are 0/1, 1/0 or 1/1 then the locus is labeled 1 on 
the branch preceding the coalescence event and if they are 0/0 it is labeled 0. 
The algorithm proceeds until all loci on all branches have a binary integer index. 

Mutations are simulated at each locus on all branches labeled 1 according to a 
Poisson process (with rate /x x t on a branch of length t) . This mimics a sampling 
process in which the investigator continues typing SNPs or microsatellite markers 
for a particular sample of chromosomes until a total of L polymorphic markers 
are obtained. A Jukes-Cantor model [23] is used to simulate SNP mutations 
(this model assumes that nucleotides A, T, G and C occur in equal frequencies), 
although more complex models with transition/transversion bias, unequal base 
frequencies, etc could be easily incorporated. In practice, with the low rate of 
nuclear mutation only one substitution will occur on a genealogy in most cases, so 
the substitution model is rather unimportant. A stepwise mutation model is used 
for modeling microsatellite mutation [19] such that each mutation (with equal 
probability) either increases, or decreases, the numbers of repeats by one unit 
(e.g., changes the length by four in a tetranucleotide repeat, etc). The number 
of (potentially polymorphic) microsatellite markers is specified by the user as a 
function of the size of the interval. The simulation results that we present here 
are for SNP markers only. Note that under our simulation procedure both the 
number (and positions) of polymorphic SNPs for a given simulation are random 
variables. 



3 Simulation Results 

In this section, we describe a simulation study conducted to examine the prop- 
erties of our method as well as the general features of the distribution of LD 
in ascertained samples of disease and normal chromosomes. The simulations 
examine the effects of disease mutation age and frequency, levels of marker poly- 
morphism, and sample size, on average levels of pairwise LD between markers 
and a disease mutation. 
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3.1 Pairwise Patterns of LD in a Growing Population 

We assumed a population size of 10® with an exponential growth rate of 0.01. 
SNPs are distributed along a region of 1 cM (1000 kb). The disease mutation 
is to the left of all markers. The sample size is set to be 50 disease and 50 
normal chromosomes in both figures 2 and 3. Figure 2 shows the influence of 
disease mutation age on LD. Disease mutation ages were chosen so that expected 
population frequencies of 0.0001, 0.001, 0.03, 0.1, 0.2 would be obtained. These 
correspond to disease mutation ages of 164, 600, 710, 840 and 910 generations. 
The polymorphism cutoff level was 0.05. In figure 3, the disease age was set to 
be 840 generations and the polymorphism cutoff levels were 0.05, 0.1, 0.2, and 
0.3. In figure 4, the disease ages and polymorphism cutoff level are the same as 
those used in figures 2 and 3. The sample sizes are 10/10, 20/20, 50/50, 100/100, 
250/250 disease/normal chromosomes. Linkage disequilibrium is measured using 
D and [26]. In total, 100 replicate simulations were carried out for each set 
of conditions and the average values of pairwise LD are plotted. We applied the 
simulation method to investigate the effects on pairwise LD of disease mutation 
age (frequency), marker allele polymorphism cutoff level, and sample size. The 
results suggest that the average LD between a disease allele and markers is 
lower for older (more prevalent) disease mutations and higher for younger (less 
prevalent) mutations (Figure 2). The marker polymorphism level also has an 
important effect on LD. Figure 3 suggests that selecting marker loci with higher 
polymorphism levels increases the average LD and could potentially increase the 
power of an LD mapping study. This effect decreases with an increasing the map 
distance of the marker from the disease mutation. From figure 4, we see that the 
LD may be biased with small sample sizes. For the simulation conditions used 
in our study, there is a positive bias in LD for markers near the disease locus 
(at a distance of less than about .004 Morgans) and a slight negative bias for 
markers beyond this distance. If more chromosomes are sampled (more than 
100), the average LD changes little. More extensive simulations are needed to 
fully understand the extent and importance of such bias. 

4 Discussion 

There is currently a great deal of interest, and optimism, concerning the prospect 
of using population level linkage disequilibrium to detect markers that are closely 
linked to a disease locus. Whether such genome- wide association studies will have 
the power to detect common genetic variants associated with complex diseases 
remains uncertain. It appears likely that the potential power of such studies 
minimally depend on factors such as the demographic history, etc of the pop- 
ulation(s) under consideration, population sampling strategies, and population 
frequencies of underlying disease loci. The goal of this paper has been to de- 
velop a more realistic simulation algorithm to generate population samples for 
the purpose of studying the influences of such factors. 

There are, of course, many additional factors that we have not considered 
that can be expected to influence the feasibility of studies aimed at identifying 
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Fig. 2. Simulation results showing average pairwise linkage disequilibrium (LD) 
between markers and a disease mutation as a function of map distance and 
disease mutation age. LD is measured using either \D\ (panel A) or (panel B). 
Younger mutations tend to show greater levels of LD with adjacent markers and 
average LD decreases monotonically with map distance. Because the locations 
of polymorphic markers are random under our simulation method map distances 
are given as intervals. 



disease loci using population-level marker-disease associations. For example, the 
relationship between genotype and phenotype can be very complex and factors 
such as the degree of penetrance, phenocopy rate, etc can be expected to greatly 
influence the power. Fortunately, it is straightforward to simulate phenotypes 
conditional on genotypes at a disease locus under arbitrarily complex models 
(including multilocus quantitative genetic models) and therefore the simulation 
methodology developed here could be used in a two-step modeling approach 
whereby the genotypes are simulated using our algorithm and the phenotypes are 
simulated conditional on the genotypes under arbitrary models of the phenotype- 
genotype relationship. 
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Fig. 3. Simulation results showing average pairwise linkage disequilibrium (LD) 
between markers and a disease mutation as a function of map distance and the 
cutoff for the minimum level of polymorphism for markers. LD is measured using 
either \D\ (panel A) or (panel B). More polymorphic markers tend to show 
greater LD with the disease mutation. Average LD decreases monotonically with 
map distance. Because the locations of polymorphic markers are random under 
our simulation method map distances are given as intervals. 
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Abstract. Recent studies have revealed that the human genome can be 
decomposed into large blocks with high linkage disequilibrium (LD) and 
relatively limited haplotype diversity, separated by short regions of low 
LD. One of the practical implications of this observation is that only a 
small number of tag SNPs are needed for mapping genes responsible for 
human complex diseases, which can significantly reduce genotyping effort 
without much loss of power. In this paper, we survey the dynamic pro- 
gramming algorithms developed for haplotype block partitioning and tag 
SNP selection, with a focus on algorithmic considerations. Extensions of 
the algorithms for analysis of genotype data from unrelated individuals 
as well as genotype data from general pedigrees are considered. Finally, 
we discuss the implications of haplotype blocks and tag SNPs in associ- 
ation studies to search for complex disease genes. 



1 Introduction 

The pattern of linkage disequilibrium (LD) plays a central role in genome-wide 

association studies of identifying genetic variation responsible for common hu- 
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man diseases (Kruglyak 1999; Nordborg and Tavare, 2002; Risch et al., 1996; 
Weiss and Clark, 2002). Compared to traditional linkage studies, association 
studies based on LD have two major advantages. First, only unrelated individ- 
uals need to be genotyped, which makes it possible to utilize a large number of 
individuals. Second, because LD reflects a large number of historical recombina- 
tion events, rather than just those in a pedigree, it is possible to map genes on 
a fine scale. Single nucleotide polymorphism (SNP) markers are preferred over 
microsatellite markers for association studies because of their high abundance 
along the human genome (SNPs with minor allele frequency greater than 0.1 
occur once about every 600 basepairs) (Wang et al., 1998), the low mutation 
rate, and accessibility to high-throughput genotyping. However, genotyping a 
large number of individuals for every SNP is still too expensive to be practical 
using current technologies. 

The number of SNPs that are required for a genome- wide association study 
depends on the pattern of LD. The more rapid the decay of LD, the more SNPs 
are needed. Previous studies have observed substantial variation in LD patterns 
across the human genome (Dunning et ah, 2000; Eisenbarth et al., 2001; Reich 
et al., 2002; Taillon-Miller et al., 2000). Thus, the number of SNPs that are 
needed for a genome-wide association study has been greatly debated in recent 
years. The estimates of the number of SNPs for an association study have large 
variation using either simulations (Kruglyak 1999) or empirical studies (e.g., 
Reich et al., 2002). Recent studies have shown that the human genome can be 
parsed into discrete blocks of high LD interspersed by shorter regions of low or 
no LD (Daly et al., 2001; Dawson et al., 2002; Gabriel et al., 2002; Johnson et 
al., 2001; Patil et al., 2001). Only a small number of characteristic (’’tag”) SNPs 
are sufficient to capture most of haplotype structure of the human genome in 
each block (Johnson et al., 2001; Patil et al., 2001). Thus the required number 
of SNPs could be greatly reduced without much loss of power for association 
studies (Zhang et al., 2002a). 

Many methods have been proposed to identify haplotype blocks and corre- 
sponding tag SNPs (Gabriel et al., 2002; Patil et al., 2001; Wang et al., 2002; 
Zhang et al., 2002b). It is not obvious which method should be used for haplo- 
type block partitioning and tag SNP selection, however. Available methods can 
be classified into two groups. One is to first identify the boundary of blocks, 
and then select tag SNPs in each resulting block (Daly et al., 2001; Dawson et 
al., 2002; Gabriel et al., 2002; Wang et al., 2002). The other group of methods 
partition the haplotypes into blocks to minimize the total number of tag SNPs 
over a region of interest or the whole genome (Patil et al., 2001; Zhang et al., 
2002b; Zhang et al., 2003). In this paper, we review the dynamic programming 
algorithms developed for haplotype block partitioning and tag SNP selection 
to minimize the total number of tag SNPs based on either haplotype data or 
genotype data. Our approaches fall in the second category of methods. 
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2 Whole-Genome Haplotype Block Partitioning 

In a large-scale study of chromosome 21, Patil et al. (2001) identified 20 haplo- 
types using a rodent-human somatic cell hybrid technique consisting of 24,047 
SNPs (with at least 10% minor allele frequency) spanning over 32.4 Mbps. They 
developed a greedy algorithm to partition the haplotypes into 4,135 haplotype 
blocks with 4,563 tag SNPs based on two criteria: (1) In each block, at least 
80% of the observed haplotypes are represented more than once, and (2) the 
minimum set of SNPs that can distinguish at least 80% of haplotypes is selected 
as tag SNPs. Zhang et al . (2002b) followed the definitions of haplotype block 
and tag SNPs of Patil et al. (2001). For the same data, they reduced the num- 
bers of blocks and tag SNPs to 2,575 and 3,582, respectively, using a dynamic 
programming algorithm. 

Before recalling the dynamic programming algorithm, we adopt the notation 
used in Zhang et al. (2002b) and Zhang et al. (2003). Assume that we are given K 
haplotype samples comprised of n consecutive SNPs: si, S2, • • • , s„. For simplic- 
ity, the SNPs are referred to as 1, 2, • • • , n in the context. Let h\,h 2 , - ■ ■ ,Hk be 
the K haplotype samples. Each haplotype hk,k = 1, 2, • • • , AT, can be represented 
as an n-dimensional vector with the z-th component hk{i) = 0, 1, or 2 being the 
allele of the k-th haplotype at the z-th SNP locus, where 0 indicates missing 
data, and 1 and 2 are the two alleles. To make the present paper self-contained, 
we summarize the definitions of ambiguous and unambiguous haplotypes used 
in Patil et al. (2001) and Zhang et al. (2002b). Consider haplotypes defined by 
SNPs z to j. Two haplotypes, hk and hk', are compatible if the alleles for the two 
haplotypes are the same at the loci with no missing data, that is hk{i) = hk'{i), 
for any l,i < I < j and hk{i)hk>{i) yf 0. A haplotype in a block is ambiguous 
if it is compatible with two other haplotypes that are themselves incompatible. 
For example, consider three haplotypes hi = (1,0, 0,2), h 2 = (1,1, 2,0), and 
ft-3 = (1, 1, 1, 2). Haplotype hi is compatible with haplotypes ft-2 and h^, but /z2 
is not compatible with h^ because they differ at the third locus. Thus, hi is 
an ambiguous haplotype, whereas ft-2 and /13 are unambiguous haplotypes. In 
Patil et al. (2001) and Zhang et al. (2002b), only unambiguous haplotypes were 
included in the analysis and compatible haplotypes were considered as identical 
haplotypes. 

The following definitions of haplotype block and tag SNP were first used by 
Patil et al. (2001), and then generalized by Zhang et al. (2002b, 2003). A set 
of consecutive SNPs can form a block if at least a percent of haplotypes are 
common haplotypes. The tag SNPs were chosen as the minimum set of SNPs 
that can distinguish at least a percent of the haplotypes. Due to the small 
sample size (20 haplotypes) used in Patil et al. (2001), the common haplotypes 
are those represented more than once. In general, the common haplotypes are 
defined as those haplotytpes with frequency at least 7. Given a and 7, the 
following functions were defined (Zhang et al., 2002b; zhang et al., 2003) for a 
set of consecutive SNPs for SNP z to SNP j: 
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• block{i, ■ ■ ■ ,j) = 1 if SNPs from i to j form a block. Otherwise, this function 
is defined as 0. 

• f{i, . . . ,j): the number of tag SNPs within a block formed by SNPs from i 

to j. Given a set of disjoint blocks, B = , B/} and B\ ^ B^ 

. . . < Bi, where B\ B 2 indicates that the last SNP of B\ is located before 
the first SNP of B 2 , (note that the last SNP of B\ and the first SNP of 
B 2 need not be consecutive, and thus the interval between them is excluded 
from this block partition), the total number of tag SNPs for these blocks is 

defined by f{B) = Y, 

i=l 

Using the above criterion for defining blocks and tag SNPs, Zhang et al. 
(2002b) designed a dynamic programming algorithm to partition haplotypes 
into blocks with the minimum total number of tag SNPs. Define S{j) to be the 
number of tag SNPs for the optimal block partition of the first j SNPs, si , . . . , 
and set S'(O) = 0. Then, 



S{j) = min{S'(i - 1) + f{i, if 1 < i < j and block{i, = 1}. 

Using the recursion, a dynamic programming algorithm was developed to 
compute the minimum number of tag SNPs for all of the n SNPs, S{n), includes 
a trace back to find the optimal block partition. 

In practice, there may exist several block partitions that give the minimum 
number of tag SNPs. Zhang et al. (2002b) used another dynamic programming 
algorithm to find the partition with the minimum number of blocks simultane- 
ously. Let C'(j) be the minimum number of blocks of all the block partitions 
requiring S{j) tag SNPs in the first j SNPs. Then, applying dynamic program- 
ming theory again. 



C{j) = min{C'(i — 1) -I- 1, if 1 < i < j and block{i, . . . ,j) = 1 

and S'(j) = S'(^-l)-k/(^,...,j)}■ 

where G(0) = 0. By this recursion, the minimum number of blocks in the 
partition, C„, can be computed. 

It is worth noting that the problem of finding the minimum number of SNPs 
within a block to uniquely distinguish all the haplotypes is known as the MINI- 
MUM TEST SET problem, which has been proven to be NP-Complete (Garey 
and Johnson, 1979). Thus, there are no polynomial time algorithms that are 
guaranteed to find the optimal solution for an arbitrary input. However, the 
number of tag SNPs in a block is generally small, so Zhang et al. (2002b) enu- 
merated all the possible SNP combinations beginning with the smallest size until 
the minimum size was located. The complexity of this algorithm can be easily 
analyzed. The space complexity for this algorithm is 0{K ■ n). Given a block 
of k SNPs: Si, , Si+k-i, the computation time for block{i, . . . ,i + k — 1) is 
0{K^ • k) because we need to determine if any two of the K haplotypes are 
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compatible at these SNPs in the block. In total, there are at most 0{n • N) 
blocks, which requires 0{K‘^ ■ iV^ • n) time for computing all values of block{-). 
N is the number of SNPs contained in the largest block, and iV <C n generally. 
The enumeration method for computing /(z, ■ ■ ■ ,j) costs at most time 0{N^) 
theoretically but runs much faster in practice. Thus, the overall time complexity 
becomes {K'^ ■ ■ n). 

Other than the definitions of haplotype block and tag SNPs used in the 
study of Patil et al. (2001) and a series papers of Zhang et al. (2002a, 2002b, 
2003), many other definitions for hapotype blocks have been used in previous 
studies. Daly et al. (2001) identified regions between which there is no apparent 
recombination event as blocks. Gabriel et al. (2002) looked for areas within 
which the most pairs of SNPs have high LD measures, for example D’, as blocks. 
Wang et al. (2002) determined the segments in which there is no evidence for 
historical recombinations between any pair of SNPs as blocks using the four- 
gamete test. Other methods for identifying tag SNPs for a given block have 
also been proposed. For example, Johnson et al. (2001) proposed to choose tag 
SNPs based on haplotype diversity. We point out that the other definitions for 
haplotype block and tag SNPs can be easily incorporated into the dynamic 
programming algorithm. Zhang et al. (2002b) adapted another method for tag 
SNP selection into their dynamic programming algorithm, in which /(•) was 
defined as the minimum number of SNPs required to explain a percentage of 
the total haplotype diversity in the block. The dynamic programming algorithm 
is still valid (Zhang et al., 2002b). The dynamic programming algorithm also 
provides a general framework to refine the blocks obtained by other methods. For 
instance, Wang et al. (2002) used four-gamete test to identify regions in which 
there are no historical recombinations between each pair of SNPs and define such 
regions as blocks. However, there may exist many block partitions that satisfy 
this criteria and it is hard to choose one without additional information. By 
setting /(•) = 1 for all potential blocks, we can use the dynamic programming 
algorithm to find a block partition with the minimum number of blocks, which is 
equivalent to finding a block partition with minimum number of recombination 
events. 

3 Haplotype Block Partitioning with Limited Resonrces 

In the above studies, the objective was to minimize the total number of tag SNPs 
for the entire chromosome. However, when resources are limited, investigators 
may not be able to genotype all the tag SNPs and instead must restrict the 
number of tag SNPs used in their studies. With a given number of tag SNPs to 
be genotyped, some of SNPs may be excluded from the analysis. The objective 
now is to prioritize SNPs and corresponding chromosomal regions for genotyping 
in association studies with limited resources. To achieve this, Zhang et al. (2003) 
introduced an additional function to represent the length of a set of consecutive 
SNPs, Si, ... , Sj-. 
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• the length of SNP i to j. It can be defined as the number of 

SNPs, L{i, ■ ■ ■ ,j) = j — i + 1. It can also be set as the actual length of 
the genome spanning from the i-th SNP to the j-th SNP. Given a set of 
disjoint blocks, B = {Bi, B 2 , ■ ■ ■ , Bj}, the total length for these blocks is 

Based on the above notation, Zhang et al. (2003) proposed to find the hap- 
lotype block partition to maximize the total length of the region included with 
a fixed number of tag SNPs and gave the following mathematical formulation: 
Block Partition with a Fixed Number of Tag SNPs (FTS): Given 
K haplotypes consisting of n consecutive SNPs, and an integer m, find a set 
of disjoint blocks B = {Bx, B^, ■ ■ ■ , Bi} with f{B) < m such that L{B) is 
maximized. 

This problem was converted to an equivalent, ’’dual” problem as follows: 
Block Partition with a Fixed Genome Coverage (FGC): Given a 
chromosome with length L, K haplotypes consisting of consecutive n SNPs, and 
/3 < 1 , find a set of disjoint blocks B = {i?i, i? 2 , • ■ • , Bi} with L{B) > (3L such 
that f{B) is minimized. 

Zhang et al. (2003) developed a 2-dimensional (2D) dynamic programming 
algorithm for the FTS problem, and then a parametric dynamic programming 
algorithm for the FGC problem. 

3.1 A 2D Dynamic Programming Algorithm 

Let S'(j, k) be the maximum length of the genome that is covered by at most k 
tag SNPs for the optimal block partition of the first j SNPs, j = 1, 2, . . . , n. Set 
S'(0, k) = 0 for any fc > 0 and S'(0, k) = —00 for any fc < 0. Then, 

[S{j-l,k) 

S{j,k) =max^ S{i - l,k - f{i, . . . ,j)) + L{i, . . . ,j) 

I for all 1 < t < J where block{i , . . . , j) = 1. 

Let B = {Bi,B 2 , . . . , Bj} be the set of disjoint blocks for S{j, k) such that 
L{B) is a maximum with the constraint f{B) < k. Then either the last block Bj 
ends before j, such that S{j, k) = S{j — l,k), or Bj ends exactly at j and starts 
at some f*, 1 < i < j , such that S'(j, k) = S{i* — 1, fc — f{Bj)) + L{Bj). Using 
this recursion, Zhang et al. (2003) designed a dynamic programming algorithm 
to compute S{n,m), the maximum length of genome that is covered by m tag 
SNPs. The optimal block partition B can be found by back tracking the elements 
of S that contribute to S{n,m). 

Zhang et al. (2003) also analyzed the complexity of this algorithm for K 
haplotypes consisting of n SNPs. The space complexity for this algorithm is 
0{m ■ n). If the values of block{-) , f {■) , and L{-) have been pre-computed, the 
time complexity of this algorithm is 0{N ■ m ■ n), where N is the number of 
SNPs contained in the largest block, and N n for large n generally. Since 
the computation time for L{- ■ ■) is 0(1), the overall time complexity becomes 
0{K^ • ■ n + N ■ m • n). 
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3.2 A Parametric Dynamic Programming Algorithm 

Zhang et al. (2003) proposed a parametric dynamic programming algorithm to 
solve the FGC problem. For a consecutive set of SNPs z, . . . , j, if block{i, ■ ■ ■ ,j) = 
1 and this block is included in the partition, the score for them is defined as /(•), 
which is the number of tag SNPs in this block. If these SNPs are excluded from 
the partition, the score for this exclusion is defined as XL{i, . . . ,j), where A is 
the parameter for deletion and A > 0. This parameter plays a crucial role in 
the algorithm. A can be regarded as the penalty for each unit length of the 
excluded regions. Using this scoring scheme, they scored a block partition by 
f{B) + XL{E), where B represents the included blocks, and E represents the 
excluded SNPs. Let the scoring function S'(j, A) be the minimum score for the 
optimal block partition of the first j SNPs {j = 1, 2, . . . , n) with respect to the 
deletion parameter A. Let S{j, A) = 0. Zhang et al. (2003) applied the dynamic 
programming algorithm to obtain S{j, A) by the following recursion: 



S{j, A) = min 



S{i - 1, A) + XL{i, . . . , j), 1 < z < j; 

S{i- 1,A) + /(z, . . . , j), 1 <i<j and block{i, . . . ,j) = 1. 



For any j, if there exists i* satisfying 1 < z* < j and S{j, A) = S{i* — 1, A) + 
then the block is included in the partition. Otherwise, 

there must exist i* satisfying 1 < z* < j and S{j, A) = ^(z* — 1, A) + AL(z*, . . . , j), 
such that the interval [z*, . . . ,j] is excluded from the partition. 

Obviously, S{n, 0) = 0 since all SNPs are excluded from the block partition, 
and S{n, oo) equals the minimum number of tag SNPs for the entire genome 
because all SNPs are included in the block partition. S{n,oo) can be obtained 
by the dynamic programming algorithm developed by Zhang et al. 2002b. For 
any fixed A > 0, the parametric dynamic programming algorithm can compute 
the optimal solution with included blocks and excluded intervals. Zhang et al. 
(2003) showed that if the length of the included blocks equals to (3L, then the 
number of tag SNPs is S{N, A) — A(1 — f3)L which must be equal to the minimum 
number of tag SNPs that is necessary to include at least (3L of the genome length. 
Thus, FGC problem can be solved by this parametric dynamic programming 
algorithm. 

The parametric dynamic programming algorithm is a natural extension of 
the classical dynamic programming algorithm which has served as a classical 
computational tool in sequence alignment, where the parameters are the weight 
of matches, mismatches, insertions/deletions and gaps (Gusfield et al., 1994; Wa- 
terman et al., 1992). Thus, the methods developed in Gusfield et al. (1994) and 
Waterman et al. (1992) can be used to study the properties of block partitions 
according to the deletion parameter A. According to Waterman et al. (1992), it 
can be shown that S{n, A) has the following properties: 

S{n, A) is an increasing, piece wise-linear, and convex function of A. The right- 
most linear segment of S{n, A) is constant. The intercept and slope for S{n, A) 
for each piecewise-linear segment are the total number of tag SNPs and the total 
length of excluded intervals, respectively. 
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Waterman et al. (1992) proposed a method to find S{n,\) for all A > 0 
efficiently, which was outlined in Zhang et al. (2003). Zhang et al. (2003) studied 
the complexity of the above algorithm. The calculation of min {S'(i — 1,A) + 

l<i<j 

f{i,---,j)} depends on the block structure of the haplotypes and is the same 
as the dynamic programming algorithm for the haplotype block partitioning 
(Zhang et al., 2002b). The parametric algorithm takes 0{N -n?) time to compute 
min {S'(i— 1, A) + Ai(t, . . . , j)}, where N is the number of SNPs contained in the 

l<i<j 

largest block. However, if L(-) is an additive function, the time can be reduced 
to 0{N ■ n) time (Waterman 1995; Waterman et al., 1992). Thus, the total time 
for finding S{n, A) is 0{K"^ ■ ■ n + N ■ S ■ n), where K is the total number 

of haplotype samples, and S is the number of segments in S{n, A), which is less 
than the total number of tag SNPs. 

After finding all the line segments of S'(n, A), we know the entire function 
of S{n,X). At each intersection point (x, S'(n, x)), several block partitions with 
different numbers of tag SNPs and lengths of excluded intervals may have the 
same score. We choose the right-most one with the maximum number of tag 
SNPs and the minimum length of excluded intervals. For each line segment 
between two intersection points, both the total number of tag SNPs and the 
total length of excluded intervals are constant along this segment, and both 
are equivalent for the intersection point between this segment and the previous 
segment. We can sort the number of tag SNPs by an ascending order according 
to the deletion parameters at the intersection points. The gaps between these 
numbers give us information as to how the block partition is affected by the 
deletion parameter A. 

The above scoring scheme using L(-) treats each SNP or each chromosome 
segment as equally important. Functional SNPs play a crucial role in associa- 
tion studies because they directly code proteins. Thus, investigators may try to 
utilize as much as possible coding regions instead of non-coding regions. Zhang 
et al. (2003) proposed weighted schemes for defining function T(-), in which a 
higher penalty was added to the SNPs in coding regions, to accommodate this 
information. For example, they gave an example of L(-) as follows: 

L{i, ...,j)=j-i + l-ric + T*nc 

where ric is the number of SNPs in coding regions and T is a relatively large 
positive number. 



4 Haplotype Block Partitioning with Genotype Data 
from Unrelated Individnals 

The above methods (Zhang et al., 2002b; Zhang et al., 2003), along with other 
methods (Patil et al., 2001; Wang et al., 2002) were initially developed based 
on haplotype data. Although laboratory techniques, such as allele-specific long- 
range PCR (MichalatosBeloin et al., 1996) or diploid-to-haploid conversion (Dou- 
glas et al., 2001), have been used to determine haplotypes in diploid individuals. 




104 K. Zhang et al. 



these approaches are technologically demanding and expensive, which makes 
it extremely difficult to do a large scale study across the whole genome such 
as done by Patil et al. (2001). For these reasons, multiple large-scale genotype 
data rather than haplotype data are being generated. It is necessary to develop 
methods to extract block information from genotype data directly. 

In all the three dynamic programming algorithms (Zhang et al., 2002b; Zhang 
et ah, 2003) we studied, the boundary of blocks and the number of tag SNPs de- 
pend on three functions for a set of consecutive SNPs, Si, . . . , Sj \ block{i , . . . , j), 
/(z, ■ ■ ■ ,j) and L{i, . . . ,j). block{-) and /(•) can be efficiently computed from a 
set of haplotypes and their frequencies. Thus, we define the haplotype block par- 
titioning problem based on genotype data from unrelated individuals as follows. 

Suppose we are given genotypes of a group of individuals. For a consecu- 
tive set of SNPs Si, Si+i , . . . ,Sj, we can first estimate the haplotype frequencies 
and then infer the two haplotypes of each individual. To calculate S{j){j = 
1,2, ...,n), the haplotype frequency and the haplotype pairs carried by each 
individual are estimated for a set of a consecutive set of SNPs: Si, s^+i, . . . , Sj, in 
which z is set equal to j initially. The function, block{i, ■ ■ ■ ,j) and /(z, ■ ■ ■ ,j) are 
then computed based on these estimates. If this set of SNPs can form an block, 
then we extend our inference of haplotypes and their frequencies to the set of 
SNPs, Si- 1 , Si, , Sj, and compute llock{-) and /(•). Otherwise, we repeat the 
above procedure to compute S{j + 1) until S{n) is obtained. 

Many methods are available to infer haplotypes and their frequencies based 
on genotypes of unrelated individuals. They can be divided into those based on 
combinatorics Clark 1990; Gusfield 2001) and those based on statistics (Excoffier 
and Slatkin, 1995; Hawley and Kidd, 1995; Lin et al., 2002; Niu et al., 2002; Qin 
et al., 2002; Stephens et al., 2001). Combinatorics based methods assign the 
two haplotypes of an individual first and then the frequencies of haplotypes are 
estimated based on the assigned haplotypes. The statistical methods first es- 
timate the frequencies of haplotypes and then two haplotypes are assigned to 
each individual according to the likelihood function. Although there are still 
debates about the optimal methods for inferring haplotype frequencies and re- 
constructing two haplotypes for each individual, we incorporate the Partition- 
Ligation-Expectation-Maximization (PL-EM) algorithm for haplotype inference 
(Qin et al., 2002) to our dynamic programming algorithms. We infer haplotypes 
and their frequencies for each consecutive set of SNPs that can form a potential 
block, rather than infer them for the whole set of SNPs. The haplotype infer- 
ence program would be executed many times in a single round. In the PL-EM 
algorithm, all of the SNP loci are broken down into ’’atomistic” units that only 
contain several SNPs (usually 5-8 SNPs) and have one or two common SNPs 
with adjacent units. The EM algorithm is first employed within each unit to 
infer the frequencies of haplotypes and the haplotypes for each individual, then 
applied to ’’ligate” two adjacent partial haplotypes together. In general, EM al- 
gorithm is efficient in time and space for a small number of SNP markers. Thus, 
this strategy can solve the speed and memory constraints that generally exist in 




Dynamic Programming Algorithms for Haplotype Block Partitioning 105 



the EM algorithm and makes it suitable for large-scale recovery of haplotypes 
from genotype data. 

To recover the haplotypes from genotype data in a large scale, Eskin et al. 
(2003) combined a local haplotype prediction algorithm and a dynamic pro- 
gramming algorithm together to determine the block boundaries directly from 
the genotype data. In their local haplotype prediction algorithm, they deter- 
mined a set of possible haplotype that appear in samples based on imperfect 
phylogeny, in which the number of haplotypes is much less than the number of 
haplotypes that are compatible with the genotype of samples. This makes it pos- 
sible to estimate the frequency of these haplotypes using the EM algorithm for 
a relative large number of SNPs. However, it is not clear if their local haplotype 
prediction algorithm could be extended to predict haplotypes for larger number 
of SNPs, especially for more than 100 SNPs. In this situation, the decreased 
number of haplotypes that are compatible with imperfect phylogeny could be 
too large to be handled by the EM algorithm. In the study of Patil et al. (2001), 
the largest block contains more than 100 SNPs. Actually, based on the same 
data and parameter setting, there are six blocks with more than 100 SNPs and 
11 blocks with more than 80 SNPs in blocks identified by Zhang et al. (2002b). 
Our current implementation of PL-EM algorithm can predict haplotypes and 
their frequencies from about 100 samples for up to 250 SNPs. This kind of scale 
should be large enough for most studies. 

5 Haplotype Block Partitioning with Genotype Data 
from General Pedigrees 

Although many methods have been developed for estimating haplotype frequen- 
cies using unrelated individuals, pedigree data are routinely collected in genetic 
studies. The genetic information from relatives in a general pedigree can help 
us resolve haplotype ambiguity. Even if ambiguities still exist with data from a 
very large pedigree, the reduction of haplotype ambiguity can help us improve 
the efficiency for estimating haplotype frequencies (Becker and Knapp, 2003). 
Thus, genotype data from general pedigrees combined with that from unrelated 
individuals can be used to improve the accuracy of the estimates for haplotype 
frequency and to detect haplotype blocks more reliably. Suppose that the haplo- 
type frequencies can be estimated for each set of consecutive SNPs that form a 
potential block, the dynamic programming algorithms for haplotype block par- 
titioning using genotype data from general pedigrees are essentially the same 
with algorithms outlined in the previous section. Thus, we focus on deriving an 
EM algorithm for estimating haplotype frequencies from general pedigrees in the 
rest of this section. 

We make two assumptions. First, we assume Hardy- Weinberg equilibrium 
(HWE) for haplotypes carried by each individual. This is the fundamental as- 
sumption of the EM algorithm for haplotype frequency estimations (Excoffier et 
al., 1995; Hawley and Kidd, 1995). Second, we assume that there is no recom- 
bination event within a family for a set of consecutive SNPs within the family. 
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Si, Sj. The rationale behind this assumption is that there is high LD and rel- 
atively low recombination in each block. Under this assumption, each haplotype 
is recoded as an allele at a single multiallelic locus. The multi-locus genotypes 
can be represented as single-locus genotypes using the recoded alleles. Laws of 
Mendelian inheritance (O’Connell 2000) are assumed to be hold. 

We introduce the following notation in our computations. Suppose that there 
are total of K haplotypes comprised of SNPs Sj, . . . , sj: IH = {hi,h 2 , ■ ■ ■ , Hk}- 
Their frequencies in the population are represented as 0 = {9i,92, ■ ■ ■ ,0 k}- 
Assume that we have a total of F families. We define an individual in a pedigree 
as a non- founder if both parents are available and others are referred as founders. 
In a family / (1 < / < A) , let n/ be the number of founders and m/ be 
the total number of individuals in the family. The n/ founders are indexed as 
1, ... ,Uf and the mf — Uf non- founder individuals are indexed asn/-|-l,...,TO/. 
The genotype and two haplotypes of individual r in family / are denoted as 
Gf,, and Hf^ = Zi/,,, 2 }, respectively. It is likely that many haplotype 

pairs are compatible with the genotypes of families under the assumption of 
no recombination events. Let S/,, denote all the compatible haplotype pairs for 
individual r in family /. It is worth noting that the unrelated individuals can 
be included into our model. Each individual forms a family and is the only 
founder member in this family. In this case, S contains all possible haplotype 
pairs compatible with the genotype of this individual. 

The objective is to estimate the haplotype frequencies based on the genotypes 
of the families. For a family /, the likelihood of the genotypes of the family is: 



L/(G/,,...,G/ |0) = E ••• E Pr(G/,,...,G/ ..,/// |0) 

= E ••• E Pr{GK,...,Gf^^\HK,...,Hf^^,0) 

n j: my 

= E ••• E UPr(HfAO) n 

SSj:„,^r=l r'=nj + l 

where Pr{Hf^\0) = 29f^^\9f^^2 if and Pr{Hf^\0) = 

if ^/r-,1 = ^/r -,2 for r = l,...,n/ under the assumption of HWE; and 
Pr{H ^ (r' = uf + l,...,m/) is the gamete transmission proba- 
bilities for unordered genotypes where Plf , and are the haplotype pairs for 
the father and mother (Elston and Stewart, 1971), respectively. The likelihood 
of all the data is obtained by multiplying . . . , Gf^^ \0) across all the 

families. We then estimate 9 using the maximum likelihood estimation (MLE) 
approach. It is difficult to directly obtain the MLE of 0. EM algorithms are 
frequently used to estimate 0. 
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Suppose that we know that O = and we want to estimate In 

the E-step, we introduce the following notation. Let 

= {#/i/ in Hf^, . . . 

which is the number of haplotype I present in the founders of family f {I = 
Let 

rif 

’■=1 r' = n^ + l 

which is the weighted number of haplotype I appeared in the founders of family 

/. We define an normalization constant satisfying = 2n/ to 

1=1 ’ 

evade the calculation of Mendelian likelihood of the whole family. In the M-step, 
we can estimate from all families as follows: 

^P+i) ^ cf )yy/(2 Y^uf) {1 = 1,2,..., K). 

/=i /=i 

The EM-based method of estimating haplotype frequencies can be very com- 
putationally intensive, since the likelihood for each configuration of founder hap- 
lotype pairs is computed during each iteration. Even if a large pedigree is very 
helpful to eliminate the number of compatible haplotype configurations to re- 
duce the computation, this number could still be too large to make the com- 
putation practical, especially for a large number of SNP loci in the presence of 
missing data. In general, the genotype elimination method (Lange and Goradia, 
1987) can be used to accelerate the likelihood evaluation (O’Connell 2000) in 
the first step. However, the genotype elimination algorithm itself can be very 
time consuming too. We propose to use a rule-based method to partially assign 
the haplotypes to each individual (Wijsman 1987) first and then perform geno- 
type elimination. Several other techniques, such as set recoding (OConnell and 
Weeks, 1995) and the partition-ligation technique (Qin et ah, 2002) can further 
reduce the complexity in performing genotype elimination and EM algorithm. 

6 Discussion 

Several recent studies have suggested that the human genome can be divided 
into blocks with high LD within each block. Due to this feature, a relative small 
fraction of SNPs can capture most of the haplotypes in each block. In this paper, 
we review a set of dynamic programming algorithms for haplotype block parti- 
tioning and tag SNP selection (Zhang et ah, 2002b; Zhang et al., 2003). These 
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algorithms guarantee to find the blocks with minimum number of tag SNPs. 
Thus, genotyping efforts can be reduced as much as possible. These algorithms 
also provide a general framework to accommodate other block definitions and 
criteria for haplotype block partitioning and tag SNP selection. For example, 
Schwartz et al. (Schwartz et al., 2003) examined the robustness of inference of 
haplotype block structure for three methods based on two optimal criteria: mini- 
mizing the total number of blocks and minimizing the total number of tag SNPs. 
Both approaches were achieved by the dynamic programming algorithm (Zhang 
et al., 2002b). 

Most methods for block partitioning and tag SNP selection were developed 
primarily based on haplotype data. Recently, block-detection methods based on 
genotype data have been developed (Eskin et al., 2003; Greenspan and Geiger, 
2003). In this paper, we discuss extensions of dynamic programming algorithms 
to genotype data from unrelated individuals as well as genotype data from gen- 
eral pedigrees. The most difficult part in this step is to estimate the haplotype 
frequency and two haplotypes carried by each individual efficiently. Although 
many methods have been developed in this area, the haplotype inference and 
analysis from large pedigrees still remain a challenging problem. Furthermore, 
with the accumulation of different types of data, such as case-control data, sib- 
ship data, pedigree data, and pooled DNA genotype data (Sham et al., 2002), 
new tools for haplotype inference from these combined data need to be devel- 
oped. 

The extent of LD varies among different populations, and thus haplotype 
block structure also show substantial variation, especially among African and 
other populations (Gabirel et al., 2002). However, the available methods are 
all based on the assumption that the population under study is homogeneous. 
A possible way around this problem is to first use unrelated SNPs to divide 
a general population into several homogeneous populations (Pritchard et al., 
1999), and then obtain the haplotype block partitions and the tag SNPs for each 
population. On the other hand, the current HapMap will be constructed based 
on several major populations and applied to the other unstudied populations. 
Obviously, this may be problematic and more data are needed to evaluate the 
validity of such generalizations. 

Many methods for haplotype block detection have been developed. The hap- 
lotype block structures and the tag SNPs identified in each study depend heavily 
on the method used. Even using same definition of haplotype blocks and tag 
SNPs, the boundary of haplotype blocks and the tag SNPs in each block iden- 
tified by the dynamic programming algorithm and other methods may not be 
unique, making it very difficult to compare them. Undoubtedly, this comparison 
along with the biological relevance of haplotype blocks, such as their relationship 
with recombination hot spots, genetic drift, population history and other factors 
(Jeffreys et al, 2001; Phillips et al., 2003), are of great interest. Our interest in 
the haplotype block is for its potential utility in association studies: to reduce 
the genotyping effort and preserve high power in mapping disease genes respon- 
sible for human complex diseases. Under this criterion, different methods can be 
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compared. Zhang et al. (2002a) compared the power of different tests using tag 
SNPs and all SNPs by extensive simulations. They found that the power using 
20% tag SNPs is only reduced less than 10% under certain conditions. However, 
they did not extend it to other tag SNP identification methods. Obviously, the 
assessment of power using tag SNPs conducted from different methods would be 
of great interest and extremely helpful for association studies. 

One of the most important implications of haplotype blocks is that the geno- 
typing effort can be largely reduced without much loss of power using only tag 
SNPs. Markers within the same block generally show a strong LD pattern. This 
advantage, along with the relatively smaller number of haplotypes defined by 
tag SNPs in each block provides a possible way to resolve the complexity of 
haplotypes. Thus, we can use each block as a unit in association studies (Daly et 
al., 2001). However, haplotype block boundaries are not unique and substantial 
LD can be found between loci in different blocks (Gabriel et al.,2002). Therefore, 
the effectiveness of using each block as a unit to study the association between 
complex traits and candidate regions needs to be further investigated. 

Acknowledgements 

This research was partially supported by National Institutes of Health Grant 
DK53392, National Institutes of Health Grant 1 R01-RR16522-01, National In- 
stitutes of Health Grant R01-HG02518-01, National Science Foundation EIA- 
0112934, and the University of Southern Galifornia. 

References 

[2003]Becker, T., Knapp, M.: Efficiency of haplotype frequency estimation when nuclear 
familiy information is included. Hum. Hered. 54 (2003) 45-53 
[1990] Clark, A.G.: Inference of haplotypes from PCR-amplified samples of diploid pop- 
ulations. Mol. Biol. Evol. 7 (1990) 111-112 

[2001]Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., Lander, E.S.: High- 
resolution haplotype structure in the human genome. Nat. Genet. 29 (2001) 229-232 
[2002]Dawson, E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., 
Pabilal, J., Dibling, T., Tinsley, E., Kirby, S., Carter, D., Papaspyidonos, M., Liv- 
ingstone, S., Ganske, R., Lohmmussaar, E., Zernant, J., Tonisson, N., Remm, M., 
Mgi, R., Puurand, T., Vilo, J., Kurg, A., Rice, K., Deloukas, P., Mott, R., Metspalu, 
A., Bentley, D.R., Gardon, L.R,, Dunham, L: A first-generation linkage disequilib- 
rium map of human chromosome 22. Nature 418 (2002) 544-548 
[2001] Douglas, J.A., Boehnke, M., Gillanders, E., Trent, J.M., Gruber, S.B.: 
Experimentally-derived haplotypes substantially increase the efficiency of linkage 
disequilibrium studies. Nat. Genet. 28 (2001) 361-364 
[2000] Dunning, A.M., Durocher, F., Healey, C.S., Teare, M.D., McBride, S.E., Gar- 
lomoagno, F., Xu, C.F., Dawnson, E., Rhodes, S., Ueda, S., Lai, E., Luben, R.N., 
Van Rensburg, E.J., Mannermaa, A., Kataja, V., Rennart, G., Dunham, L, Purvis, 
L, Easton, D., Ponder, B.A.J.: The extent of linkage disequilibrium in four popula- 
tions with distinct demographic histories. Am. J. Hum. Genet. 67 (2000) 1544-1554 




110 K. Zhang et al. 



[2001]Eisenbarth, L, Striebel, A.M., Moschgath, E., Vodel, W., Assum, G.: Long-range 
sequence composition mirrors linkage diseqnilibrinm pattern in 1 1.13 MB region 
of human chromosome 22. Hum. Mol. Genet. 24 (2001) 2833-2839 
[1971]Elston, R.G., Stewart, J.: A general model for the genetic analysis of pedigree 
data. Hum. Hered. 21 (1971) 523-542 

[2003]Eskin, E., Halperin, E., Eskin, E.: Large scale recovery of haplotypes from geno- 
type data using imperfect phylogeny. In Miller W, Vingron M, Sorin I, Pevzner P, 
Waterman M (eds), Proceedings of the Seventh Annual International Gonference 
on Research in Computational Molecular Biology (RECOMB 2003). ACM, New 
York. (2003) 104-113 

[1995]Excofiier, L., Slatkin, M.: Maximum-likelihood estimation of molecular haplotype 
frequencies in a diploid population. Mol. Biol. Evol. 12 (1995) 921-927 
[2002] Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., 
Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., 
Adeyemo, A., Cooper, R., Ward, R., Lander, E.S., Altshuler, D.; The structure of 
haplotype blocks in the human genome. Science 296 (2002) 2225-2229 
[1979]Garey, M.R., Johnson, D.S.: Computers and Intractability. Freeman, New York. 
(1979) 222 

[2003] Greenspan, G., Geiger, D.: Model-based inference of haplotype block variation. 
In Miller W, Vingron M, Sorin I, Pevzner P, Waterman M (eds), Proceedings of the 
Seventh Annual International Conference on Research in Computational Molecular 
Biology (RECOMB 2003). ACM, New York. (2003) 131-137 
[2001]Gusfield, D.: Inference of haplotypes from samples of diploid populations: Com- 
plexity and algorithms. J. Comp. Biol. 8 (2001) 305-323 
[1994]Gusfield, D., Balasubramanian, K., Naor, D.: Parametric optimization of se- 
quence alignment. Algorithmica 12 1994 312-326 
[1995]Hawley, M.E., Kidd, K.K.: HAPLO: a program using the EM algorithm to esti- 
mate the frequencies of multi-site haplotypes. J. Hered. 86 (1995) 409-411 
[2001] Jeffreys, A.J., Kauppi, L., Neumann, R.; Intensely punctate meiotic recombina- 
tion in the class II region of the major histocompatibility complex. Nat. Genet. 29 
(2001) 217-222 

[2001] Johnson, G.C.L., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Genova, 
G.D., Ueda, H., Cordell, H.J., Eaves, LA., Dudbridge, F., Twells, R.C.J., Payne, F., 
Hughes, W., Nutland, S., Stevens, H., Carr, P., Tuomilehto-Wolf, E., Tuomilehto, J., 
Gough, S.C.L., Clayton, D.G., Todd, J.A.: Haploype tagging for the identihcation 
of common disease genes. Nat. Genet. 29 (2001) 233-237 
[1999]Kruglyak, L.: Prospects for whole-genome linkage disequilibrium mapping of com- 
mon disease genes. Nat. Genet. 22 (1999) 139-144 
[1987]Lange, K., Goradia, T.M.: An algorithm for automatic genotype elimination. Am. 
J. Hum. Genet. 40 (1987) 250-256 

[2002]Lin, S., Cutler, D.J., Zwick, M.E., Chakravarti, A.: Haplotype inference in ran- 
dom population samples. Am. J. Hum. Genet. 71 (2002) 1129-1137 
[1995]Long, J.C., Williams, R.C., Urbanek, M.: An E-M algorithm and testing strategy 
for mutiple-locus haplotypes. Am. J. Hum. Genet. 56 (1995) 799-810 
[1996]MichalatosBeloin, S., Tishkoff, S.A., Bentley, K.L., Kidd, K.K., Ruano, G.: 
Molecular haplotyping of genetic markers 10 kb apart by allelic-specihc long-range 
PGR. Nucleic. Acids Res. (24 (1996) 4841-4843 
[2002]Niu, T., Qin, Z., Xu, X., Liu, J.S.: Bayesian haplotype inference for multiple 
linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70 (2002) 157-159 
[2002]Nordborg, M., Tavare, S.: Linkage disequilibrium: what history has to tell us. 
Trends Genet. 18 (2002) 83-90 




Dynamic Programming Algorithms for Haplotype Block Partitioning 111 



[2000] O’Connell, J.R.: Zero-Recombinant haplotyping: applications to fine mapping 
using SNPs. Genet. Epidem. 19(Suppl 1) (2000) S64-S70 

[1995] O’Connell, J.R., Weeks, D.E.: The VITESSE algorithm for rapid exact multilocus 
linkage analysis via genotype set-recoding and fuzzy inheritance. Nat. Genet. 11 
(1995) 402-408 

[2001]Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., 
Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., Nguyen, B.T.N., 
Norris, M.C., Sheehan, J.B., Shen, N., Stern, D., Stokowski, R.P., Thomas, D.J., 
Trulson, M.O., Vyas, K.R., Frazer, K.A., Fodor, S.P.A., Cox, D.R.: Blocks of limited 
haplotype diversity revealed by high-resolution scanning of human chromosome 21. 
Science 294 (2001) 1719-1723 

[2003] Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, D.J., 
Donaldson, M.A., Studebaker, J.F., Ankener, W.M., Alfisi, S.V., Kuo, F.S., Camisa, 
A.L., Pazorov, V., Scott, K.E., Carey, B.J., Faith, J., Katari, G., NBhatti, H.A., 
Cyr, J.M., Derohannessian, V., Elosua, C., Forman, A.M., Grecco, N.M., Hock, 
C.R., Kuebler, J.M., Lathrop, J.A., Mockler, M.A., Nachtman, E.P., Restine, S.L., 
Varde, S.A., Hozza, M.J., Gelfand, C.A., Broxholme, J., Abecasis, G.R., Boyce- 
Jacino, M.T., Cardon, L.R.: Chromosome-wide distribution of haplotype blokes 
and the role of recombination hot spots. Nat. Genet. 33 (2003) 382-387 

[1999]Pritchard, J.K., Rosenberg, N.A.: Use of unlinked genetic markers to detect pop- 
ulation stratification in association studies. Am. J. Hum. Genet. 65 (1999) 220-228 

[2002]Qin, Z., Niu, T., Liu, J.: Partitioning-Ligation-Expectation-Maximization Algo- 
rithm for haplotype inference with single-nucleotide Ploymorphisms. Am. J. Hum. 
Genet. 71 (2002) 1242-1247 

[2002]Reich, D.E., Schaffner, S.F., Daly, M.J., MeVean, G., Mullikin, J.G., Higgins, 
J.M., Richter, D.J., Lander, E.S., Altshuler, D.: Human genome sequence variation 
and the influence of gene history, mutation and recombination. Nat. Genet. 32 
(2002) 135-142 

[1996]Risch, N., Merikangas, K.: The future of genetic studies of complex human dis- 
eases. Science 273 (1996) 1516-1517 

[2003] Schwartz, R., Halldorsson, B.V., Bafna, V., Glark, A.G., Sorin, L: Robustness of 
inference of haplotype block structure. J. Gomput. Biol. 10 (2003) 13-19 

[2002]Sham, P., Bader, J.S., Graig, L, O’Donovan, M., Owen, M.: DNA pooling: a tool 
for larger-scale association studies. Nat. Rev. Genet. 3 (2002) 862-871 

[2001] Stephens, M., Smith, N.J., Donnelly, P.: A new statistical method for haplotype 
reconstruction from population data. Am. J. Hum. Genet. 68 (2001) 978-989 

[2000]Taillon-Miller, P., Bauer-Sardina, L, Saccone, N.L., Putzel, J., Laitinen, T., Gao, 
A., Kere, J., Pilia, G., Rice, J.P., Kwork, P.Y.: Juxtaposed regions of extensive and 
minimal linkage disequilibrium in human Xq25 and Xq28. Nat. Genet. 25 (2000) 
324-328 

[2002]Wang, N., Akey, J.M., Zhang, K., Chakraborty, K., Jin, L.: (2002) Distribution 
of recombination crossovers and the origin of haplotype blocks: The interplay of 
population history, recombination, and mutation. Am. J. Hum. Genet. 71 (2000) 
1227-1234 

[1998]Wang, D.G., Fan, J.B., Siao, G.J., Berno, A., Young, P., Sapolsky, R., Ghan- 
dour, G., Perkins, N., Winchester, E., Spencer, J., Kruglyak, L., Stein, L., Hsie, 
L., Topaloglou, T., Hubbell, E., Robinson, E., Mittmann, M., Morris, M.S., Shen, 
N.P., Kilburn, D., Rioux, J., Nusbaum, G., Rozen, S., Hudson, T.J., Lipshutz, 
R., Ghee, M., Lander, E.S.: Large-scale identification, mapping and genotyping of 
single-nucleotide polymorphisms in the human genome. Science 280 (1998) 1077- 
1082 




112 K. Zhang et al. 



[1995] Waterman, M.S.: Introduction to compntational biology: maps, sequences and 
genomes. Chapman & Hall/CRC Press, Boca Raton FL. (1995) 

[1992]Waterman, M.S., Eggert, M., Lander, E.L. (1992) Parametric sequence compar- 
isons. Proc. Natl. Acad. Sci. USA. 89 (1992) 6090-6093 
[2002]Weiss, K.M., Clark, A.G.: Linkage disequilibrium and the mapping of complex 
human traits. Trends Genet. 18 (2002) 19-24 
[1987] Wijsman, E.M.: A dedeuctive method of haplotype analysis in pedigrees. Am. J. 
Hum. Genet. 51 (1987) 356-373 

[2002a] Zhang, K., Calabrese, P., Nordborg, M., Sun, F.: Haplotype structure and its 
applications to association studies: power and study design. Am. J. Hum. Genet. 
71 (2002a) 1386-1394 

[2002b]Zhang, K., Deng, M., Chen, T., Waterman, M.S., Sun, F.: A dynamic program- 
ming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. USA. 95 
(2002b) 7335-7339 

[2003]Zhang, K., Sun, F., Waterman, M.S., Chen, T.: Haplotype Block Partition with 
Limited Resources and Applications to Human Chromosome 21 Haplotype Data. 
Am. J. Hum. Genet. 73 (2003) 63-73 




Parametric Bootstrap for Assessment of 
Goodness of Fit of Models for 
Block Haplotype Structure 



Maoxia Zheng^* and Mary Sara McPeek^’^ 

^ Department of Statistics 
University of Chicago 
5734 S. University Ave. 

Chicago, IL 60637 
zhengOgalton. uchicago . edu 
Department of Human Genetics, 
University of Chicago, Chicago, IL 



Abstract. Our ultimate interest is in formalization of models for high- 
resolution haplotype structure in such a way that they can be useful in 
statistical methods for linkage disequilibrium (LD) mapping. Some steps 
in that direction have been taken by Daly et al.(2001) who describe 
a block structure of haplotypes. They outline a hidden Markov model 
(HMM) that allows for common haplotypes in each block. We propose 
somewhat different models that also use HMM, which allow for haplo- 
types in a block that are not one of the common types and more complex 
graph structure of preferred and non-preferred transitions between hap- 
lotypes in adjacent blocks. In this paper, we also address the problem 
of assessing goodness of fit of such models to data when only unphased 
genotype data on individuals or trios are available. We find that the 
traditional parametric bootstrap method to assess the goodness of fit of 
models does not have the right type I error in our case, where we have 
multinomial-type models with many cells having very low probabilities 
and only a moderate sample size. 



Introduction 

Two fundamental types of statistical problems arise in LD mapping. One is 
hypothesis testing to detect LD of a trait to a region. It is relatively easy to 
propose a variety of valid hypothesis tests to solve this problem (with multiple 
comparisons being the main difficult issue). The other fundamental statistical 
problem is the localization problem, which is to construct a confidence interval 
(Cl) for the suspected locus. Most methods for testing for LD are not appropri- 
ate for construction of CL This is because most methods use different amounts 
of information for the hypothesis tests at different loci, so that hypothesis test- 
ing results at different points are not formally comparable. Methods that are 
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appropriate for Cl construction include those that provide a multilocus likeli- 
hood for the entire set of data, evaluated at each locus, e.g. the methods of 
McPeek and Strahs (1999), Morris et al. (2000), Liu et al. (2001) and others. 
Such methods typically require, and are sensitive to, a model for background 
LD in the region. In practice, linkage equilibrium or a Markov model of order 1 
on nucleotides have been used, but such models are not adequate to capture the 
structure of background LD (Strahs and McPeek (2003)). What is needed is a 
relatively parsimonious class of models that adequately captures this structure 
and that is computationally feasible for use in LD mapping. We are willing to 
consider purely descriptive models (e.g. Markov models) for background LD in 
a population in addition to considering more explicit population-genetics based 
models. 

Daly et al. (2001) consider a class of hidden Markov models in which the time 
index of the hidden Markov chain is nucleotide, and the state space consists of 
the common haplotypes in each block. They force each observed haplotype to 
either match one of the common haplotypes in that block or to result from a 
combination of random genotyping errors and (rare) intra-block recombination 
events applied to common haplotypes, with these random genotyping errors and 
intra-block recombination events assumed to be independent from haplotype to 
haplotype. Thus, they have up to four common haplotypes in each block, with 
all others assumed to arise independently as rare variants on these. They define 
four ancestral haplotypes through the entire region, and the transitions from one 
common ancestral haplotype at one nucleotide to a different ancestral haplotype 
at the next nucleotide are defined to be “historical recombinations.” When a 
historical recombination occurs, they assume that it is equally likely to go to 
any other ancestral haplotype. The initial probabilities at the first nucleotide 
are chosen to be the population frequencies of the ancestral haplotypes. We note 
that this Markov model is not stationary, so the joint probability model differs 
depending on whether one starts the Markov model at the 3' end of the region 
and runs the “time” index toward the 5' end or vice versa. We build on some of 
the ideas in Daly et al. (2001) and devise a rather different class of models. 

To assess the goodness of fit of models, if we assume that haplotypes follow 
some type of multinomial distribution, classical approaches would be Pearson’s 

test if we have complete data and likelihood ratio between fitted model 
and unconstrained multinomial model when we have incomplete data. However, 
there are several difficulties when we consider the haplotypes extending across 
all blocks. We have too many cells with very low counts (i.e. haplotypes with low 
frequencies) for x^ approximations to work. This is exacerbated by incomplete 
data when we take into account uncertainty in missing haplotypes. Also, when 
fitting the unconstrained multinomial model, in the case when we have only 
genotype data, it is computationally prohibitive because we need to consider all 
possible haplotypes compatible with the data, where haplotypes extend across 
all blocks. We instead consider a goodness-of-fit test statistic that is a likelihood 
ratio between the fitted model of interest and the best-fitting model in a larger 
class of Markov models, and we use a Monte Carlo procedure, parametric boot- 
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strap, to assess the significance. We investigate the properties of this method as 
applied to our problem. 



Methods 

Hidden Markov Methods 

We use hidden Markov models that differ in several ways from those used 
in Daly et al.(2001). First, the time index of the Markov chain is block, rather 
than nucleotide. Second, the state space consists of both common and uncommon 
haplotypes in each block, i.e. we do not force each observed haplotype either to 
be one of the common haplotypes or to arise independently as a recombinant or 
genotyping error applied to common haplotypes. Thus, we effectively allow for 
additional haplotypes that are neither extremely common nor extremely rare. 
As a result, for each block, there are different numbers of states. We allow for 
historical recombination between identical haplotypes as well as non-identical 
ones, and we allow a more complex graph structure of what we call “preferred” 
and “non-preferred” transitions (in practice, these are chosen by comparing the 
observed marginal frequencies of haplotypes in a block and conditional frequen- 
cies of haplotypes in the block given haplotype in an adjacent block). According 
to the structure, we can define the transition probabilities for the hidden Markov 
chain. 

We first describe our hidden Markov chain for the case in which only common 
haplotypes are allowed and later we describe the extension to allow uncommon 
haplotypes. Let {Xt\ denote the hidden Markov chain corresponding to one of an 
individual’s two copies of a chromosomal region, where t is indexed by block. Xt 
takes values in the state space St, where St is the set of all distinct haplotypes in 
block t. Let VFi, W 2 , • • • , Wki be the distinct haplotypes in block t (i.e. elements 
of St), and V\,V2, - ■ ■ , Vk2 be the distinct haplotypes in block t-l-1 (i.e. elements 
of 5't+i). In the graph structure, an edge between a given element of St and a 
given element of 5't+i represents a preferred transition, and the absence of an 
edge represents a non-preferred transition. We consider three types of preferred 
transitions between adjacent blocks, as follows: 

1. “Peer”:Wi has only one edge and this is with a Vj that has no other edge 
(Fig 1). In this case, we require freq(Wi) = freq(Vj) and 

Prob(Xt+i = Vj\Xt = W^) = (1 - Ot) + 6»tfreq(Vj) 



Block t 



Block t+1 



Wi 



Vj 



Fig. 1. “Peer” relationship between Wi in block t and Vj in block t-l-1 . 
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where 9t is the recombination rate between block t and block t+1. Note that 
the joint probability is the same when the model is applied in the reverse 
direction, i.e. 

Prob(Xt+i = Vj,Xt = W^) = Prob(Xt = W,)Vroh{Xt+i = Vj\Xt = W,) 

= freq(Wi)[(l - 0t) + 6»tfreq(Pj)] 

= Prob(Xt+i = l^)Prob(Xt = Wi\Xt+i = Vj) 
= freq(Pj)[(l - 9t) + 6»tfreq(Wi)] 



2. “Parent-to-child” : Wi has edges with multiple P’s and for every Vj having 
edge with Wi, Vj has only that edge. Vj is called the “child” of Wi, and Wi 
is the “parent” of Vj (Fig 2). 



Block t Block t+1 

VI 

V2 

V3 

Fig. 2. “Parent-to-child” relationship between Wi in block t and multiple P’s in 
block t -I- 1 . 




In this case, we require freq(VP) = J2k-.Vk is a child of Wi freq(Pfe) and 
Prob(X,+i = V,\Xt = IP,) = (1 - + 0tfreq(P,) 

3. “Child-to-parent” : Similar to previous case, but with Vj having edges with 
multiple IP’s, and for every IP^ having edge with Vj, Wi has only that edge. 
Then we have freq(Pj) = Ek-.w^ is a child of v, freq(IPfe) and 

Prob(Xt+i = Vj\Xt = Wi) = (1 - 9t) + 9tfreq{Vj). 

As for case 1, note that the joint probability is the same whether the model 
is applied in the parent-to-child or child-to-parent direction. 

Suppose the blocks are labeled t = 1,. . . ,L. In the graph structure for com- 
mon haplotypes, we require that each common haplotype in St have an edge 
with at least one element of St-i (for t > 1) and at least one element of 
(for t < L). Furthermore, we require that every element of St fit into one of the 
above three cases with regard to preferred transitions to elements of ^t+i.IfP, 
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and Wi do not have an edge between them in the preferred-transition graph, 
then we use Prob(ATi_|_i = Vj\Xt = Wi) = 6tireq{Vj). 

We now indicate how to add uncommon haplotypes to the model. We as- 
sume that within each block, the frequencies of uncommon haplotypes sum to 
a parameter o, 0 < o < 1, where o is constant across blocks. We do not allow 
uncommon haplotypes to have any edges in the preferred transition graph. Then 
we calculate Prob(W-i-i = = Wi) as follows: 

1. if Wi and Vj are both common and are peers, 

(1 - o)(l - 9t) + 6»t freq(V,); 



2. if Wi and Vj are both common, and Wi is the parent of Vj, 

3. if Wi and Vj are both common, and Wi is the child of Vj, 

(1 - o)(l - 9t) + 6»tfreq(Pj); 



4. if Wi and Vj are both common and do not have an edge between them in 
the preferred-transition graph, 0tfreq(V}); 

5. if at least one of Wi and Vj is uncommon, freq(V^ ). 

We also allow another layer of hidden states, so that a given observed haplotype 
in block t may be compatible with multiple elements of the hidden state space 

Sf 

We now consider the observation distribution for the hidden Markov chain. 
First suppose that we have genotype data on individuals. For a given individual, 
let Yf denote the (unphased) genotype data in block t, with missing data permit- 
ted. The hidden Markov chain for genotype data corresponds to two indepen- 
dent copies of the Markov chain {Xt} described above, one for the maternally- 
inherited and one for the paternally-inherited strand, {AT™} and {Xf}, respec- 
tively. The observation distribution of the HMM is given by P{Yt\Xp, X[} = 1 
if the observed genotype data are compatible with maternal and paternal haplo- 
type combination (Ai™, X^) and 0 otherwise. (In principle, this could be modified 
to allow for genotyping error.) The traditional HMM approach would handle the 
missing and heterozygous sites by listing all the possible resolutions. However, in 
the computation, we simply consider all the possible resolutions for the observed 
data instead of going through the entire state space of the hidden Markov chain. 
This turns out to be much more efficient, particularly when we are dealing with 
trio data. In the case of trio data (two parents and an offspring), we consider 4 
parental haplotypes (and 2 parental genotypes) at the same time, so the number 
of states in the state space of block t will be |S't|^, where |5't| is the number of 
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distinct haplotypes in block t. However, by listing all the compatible 4-tuples of 
haplotypes for this family, the number of possible resolutions for this family is 
much smaller than 

Within a given model, parameters include the frequencies of haplotypes in 
each block (constrained by the specific graph structure of preferred transitions) 
and “historical recombination” rate parameters 6t between each pair of adjacent 
blocks. Because of the Markov structure, a proposed model can be efficiently fit 
to data via the EM procedure Dempster et al.(1977) by an obvious extension of 
the Baum (1972) algorithm for HMMs, which allows both likelihood calculation 
and maximization of likelihood over parameters. 

Parametric bootstrap for assessment of significance of goodness-of-fit statistic 
For any given data set of finite size, there will be multiple distinct models 
that fit the data, because there will be limited information to distinguish among 
models. Our goal is to distinguish, from among proposed models, those that fit 
from those that show significant misfit to a particular set of data. 

We use a Monte Carlo (parametric bootstrap) assessment of significance 
rather than a approximation. This gives us more latitude in choice of goodness- 
of-fit statistic S as well. At the moment, we take S to be the likelihood ratio be- 
tween the fitted model (call this model D, our haplotype block structure model) 
and the general Markov model on blocks with unconstrained initial frequen- 
cies and transition probabilities (call this model G) . Model G is typically vastly 
overparametrized relative to model D, so may be a reasonable surrogate for 
the unconstrained multinomial model. We can also use HMM to calculate the 
likelihood of model G efficiently. 

Our Monte Carlo procedure for assessment of significance of a goodness-of-fit 
statistic S is as follows: 

1. Fit model D, the haplotype block structure model, to the data, estimate the 
parameters (call the estimated parameter vector X^), calculate goodness-of- 
fit statistic S^. 

2. Simulate m realizations of data from model with parameters A^, keeping the 
same pattern of missing information as in the real data. For example, if one 
individual’s genotype is missing at some sites, then in the simulated data 
set, his genotype will be still missing at those sites. For the ith realization, 
estimate parameters A*, and calculate the goodness-of-fit statistic S'*, for 
i = 1, . . . ,m. 

3. Find the percentile rank of among S^,---,S”* and reject (i.e. model 
doesn’t fit) if is in the upper/lower a/2 tail of the distribution (e.g. 
a = 0.05) 



Monte Carlo procedure for assessment of type I error 

To assess the type I error of the significance test, we propose a Monte Carlo 
procedure that adds another layer of simulation to that just described: 
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1. Fit model D, the haplotype block structure model, to the original data, 
estimate the parameters (call the estimated parameter vector A^), calculate 
goodness-of-fit statistic S^. 

2. Simulate mi realizations of data from model D with parameters A^. For 
the fth realization, estimate parameters A®, and calculate the goodness-of-fit 
statistic S^, for i = 1, . . . , mi. 

3. For zth realization, simulate TO 2 new realizations of data from model D with 
parameters A*. For the ijth realization, estimate parameters A®-^, and calcu- 
late the goodness-of-fit statistic for j = 1, . . . ,m 2 - Find the percentile 
rank of 5® among • • • , 5®’”^ and reject (i.e. model doesn’t fit) if 5® is in 
the upper/lower a/2 tail of the distribution (e.g. a = 0.05). 

4. Calculate the type I error as the proportion of realizations out of mi for 
which the null hypothesis is rejected. 



Results 

We consider the data set of Daly et al. (2001). It is a 500kb region from chromo- 
some 5q31 which may contain a genetic risk factor for Crohn’s disease. There are 
103 common SNPs genotyped in 129 trios from a European-derived population. 
We use more or less the same choices of blocks as in Daly et al.(2001), except that 
we make the modification suggested on their website. From the phased haplotype 
data generously provided by Mark Daly, we get the distinct haplotypes (common 
haplotypes and uncommon haplotypes) in each block. There are 11 blocks and 
the numbers of distinct haplotypes are 18, 11, 39, 31, 15, 9, 77, 31, 22, 28, 10, 
from block 1 to block 11 respectively. We add some haplotypes to the state space 
to ensure that each family has at least one possible resolution of haplotypes that 
is compatible with their genotype data. By comparing the marginal counts of 
each haplotype to the conditional counts given each haplotype in an adjacent 
block, we propose various choices of the preferred and non-preferred transitions 
between haplotypes in adjacent blocks. Fig 3 is one of the graph structures we 
consider, where the edges denote different preferred transitions across the blocks 
among common haplotypes. 




Fig. 3. One of the graph structures for common haplotypes we consider for the 
data set of Daly et al. (2001), where different colored edges represent different 
preferred transitions across blocks. 
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After introducing some hidden states to some blocks (letters in the boxes 
in Fig 4) and incorporating the uncommon haplotypes in each block (Oij’s in 
Fig 4), we get a new picture where the preferred transitions occur along parallel 
lines, i.e. all preferred transitions are of the “peer” type described in Methods. 
In that case, the frequency of each common haplotype in a block (including the 
hidden common haplotypes) should be the same as that of its peer in the next 
block. The total frequency of uncommon haplotypes in each block is also the 
same across blocks. 
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Fig. 4. Extension of Figure 3 to allow uncommon haplotypes {Oij’s in each 
block) and to introduce hidden states in some blocks (words in boxes). 



For the data set of Daly et al.(2001), we use the same graphical structure 
and same recombination rates between blocks for both untransmitted and trans- 
mitted haplotypes, but we allow the haplotype frequencies to differ between 
untransmitted and transmitted haplotypes. Based on the fitted model, the total 
estimated frequency of common haplotypes is 93% in each block among trans- 
mitted chromosomes and 85% in each block among untransmitted chromosomes 
(these values are constrained by the model to be constant across blocks). The 
estimated recombination rates between adjacent blocks are 0.16, 0.38, 0.27, 0.13, 
0.20, 0.13, 0.09, 0, 0.28, and 0.35, respectively, from the first block to the last 
block. The recombination rate is not directly comparable to that in Daly et 
al.(2001) as our definition of historical recombination differs from theirs. 

To assess the goodness of fit for this haplotype block structure model, we 
simulated 500 data sets from model D (haplotype block structure model) with the 
parameter values set to the estimates from the original data. For each simulated 
data set, we fit model D and model G (the unconstrained Markov order 1 model 
on blocks) and got their likelihoods. The graph in Fig 5 shows that the likelihood 
ratio (model G to model D) for the original data is much higher than all the 
likelihood ratios from the simulated data. If the type I error of the test were 
correct, this would imply that our haplotype block structure model (Model D) 
would be rejected in favor of the unconstrained block Markov model (Model G) . 
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We assessed the type I error of our parametric bootstrap procedure as de- 
scribed in the Monte Carlo procedure for assessment of type I error sub-section 
of the Methods section, with mi set to 100 and m 2 set to 200. (Here we use the 
phased untransmitted haplotype data instead of the unphased genotype data 
for simplicity in computation and easier comparison of simulated data sets and 
original data sets.) We found that the null hypothesis was rejected in every case, 
i.e. the type I error was estimated to be close to 1 instead of .05. Thus, the para- 
metric bootstrap performed very poorly in this situation. Comparison of data 
sets simulated from model D with the original data set shows that the simulated 
data sets have much smaller numbers of distinct haplotypes in each block than 
does the original data set (results not shown). That is, rare haplotypes in the 
original data set tended to be “lost” in the simulations, just by chance, and new 
haplotypes tended not to be created, due to the combination of model choice 
and estimation of parameters by maximum likelihood. Possible solutions to this 
problem are discussed in the Discussion section. 




Fig. 5. Monte Carlo procedure to assess the goodness of fit of model D. Lines 
from top to bottom are log likelihood of model G for original data, log likelihood 
of model G for simulated data, log likelihood of model D for simulated data and 
log likelihood of model D for original data. 500 data sets are simulated. 



Discussion 

Our interest is in developing probability models of background LD for use in 
likelihood-based or Bayesian LD mapping methods. For the localization problem 
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in particular, it is important to find a model that can adequately capture the 
features of background LD (Strahs and McPeek (2003)). For any given data set, 
there will typically be multiple distinct models that fit the data. Therefore, one 
might expect that different investigators looking at the same set of data might 
reasonably come up with different models. This becomes even more of an issue as 
one considers different data sets for the same region from the same population, or 
the same region across different populations. From the point of view of mapping, 
it should not be necessary to come up with “the correct” model. A parsimonious 
model that fits the relevant control data should be adequate. 

Building on the ideas of Daly et al.(2001), we formulate a class of hidden 
Markov models for background LD. This class allows a fairly general graph 
structure of preferred and non-preferred transitions based on the haplotype block 
structure with corresponding well-defined transition probabilities. It allows for 
common haplotypes and uncommon haplotypes in each block. Also it captures 
the idea of ancestral haplotypes. Our work to date does not address the issue 
of e.g. how to choose blocks and common haplotypes within blocks though they 
are important issues. 

We considered a Monte Carlo (parametric bootstrap) procedure to assess the 
goodness of fit of models for block haplotype structure. This approach allows a 
wide latitude in choice of test statistic. (In choosing a test statistic, we have 
so far considered the likelihood ratio between two models or the likelihood of 
the proposed model itself.) We proposed an additional layer of Monte Carlo to 
assess the type I error of the parametric bootstrap procedure for assessment of 
goodness of fit. The results indicate that the parametric bootstrap performed 
poorly for our case in which we have multinomial-type models with many cells 
having very low probabilities and only a moderate sample size. 

This leads to an interesting question: for a multinomial distribution where 
there are a lot of cells with very low probabilities, how to simulate from this 
distribution without losing too many rare cells? A simple way of dealing with 
this is to add some value to each cell before calculating observed proportions, 
as in Bayesian estimation with a symmetric Dirichlet prior. However, this ap- 
proach also did not work well in our case (results not shown), because it is 
too non-specific. Simply adding some value to every cell roughly corresponds to 
an assumption that when a new haplotype arises, its type is uniformly chosen, 
whereas in reality, new haplotypes that arise tend to be similar to existing hap- 
lotypes. This suggests a more structured approach to smoothing the haplotype 
probabilities, which we are currently working on. We note that the Daly et al. 
model appears to have some structured smoothing built into it. Currently, we 
are working on a model that incorporates all the features of our model described 
here, with additional structured smoothing. 
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Abstract. There is currently great interest in the genetics community in 
identifying the genes that are involved in complex (multifactorial) diseases. 
One standard approach uses family -based “linkage” methods to identify regions 
of the genome that harbor susceptibility genes, and then population-based 
“linkage-disequilibrium” (LD) methods to search more narrowly for the culprit 
genes. Many of the important statistical issues involved in the first phase of this 
process, linkage mapping, have been addressed, but data analysis for LD- 
mapping remains extremely challenging. In this article we describe our work on 
developing a full statistical framework for LD mapping based on a population 
genetics model known as the coalescent. 



Introduction 

Complex diseases are diseases that result from the interaction of multiple factors, 
usually including both genetic and environmental factors (Risch 2000). While 
identifying the genes involved in such diseases is of considerable interest, this has 
proved extremely challenging — largely because the statistical signal due to any 
particular risk gene is typically quite small. Thus, it is clearly important to have 
statistical methods that can extract as much of the available signal from the data as 
possible. 

Broadly speaking, there are two main approaches to gene mapping (Risch 2000). 
The first, linkage mapping, aims to find genetic markers whose transmission within 
families is correlated with the transmission of the disease. The second approach, 
known as association or LD mapping, aims to find genetic variants that are more 
common in affected individuals than in unrelated healthy controls; such markers are 
likely to be close to the actual disease variants. LD mapping is also commonly used to 
localize the position of the disease variants; this will be the focus of the present 
article. On average, the strength of the association between genetic markers and 
phenotype status should be strongest close to the site of the actual disease variant(s), 
and weaker at more distant markers. This fairly basic intuition works well for simple, 
Mendelian traits (e.g., Kerem et al 1989, Hastbacka et al 1992), but is more 
challenging to apply to data for complex traits, where the overall signal is weaker 
(Horikawa et al 2000). 
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For many data sets, the statistieal aspeets of linkage analysis are now relatively 
routine, and there are standard, widely-used approaehes to both parametrie and 
nonparametrie analysis that make effieient use of the genotype data (Kruglyak et al 
1996). Among the remaining statistieal ehallenges in linkage analysis is the 
development of eomputational methods that are feasible for data sets with very large 
or eomplieated pedigree struetures (e.g., Abeeasis et al 2002, Abney et al 2002). 

The situation with LD mapping is far less straightforward, as there is still little 
eonsensus about even what the underlying statistieal model should be. Among the 
available methods for LD fine mapping, a few are relatively nonparametrie, for 
example, adapting standard regression teehniques to the estimation problem 
(Lazzeroni 1997). While sueh methods are relatively simple to implement, and to use, 
they may not be well ealibrated beeause the data have a eomplieated dependenee 
strueture aeross both markers and individuals, produeed by the evolutionary proeess 
(Morris et al 2002). An alternative approaeh involves explieitly modelling the 
evolutionary proeess that produees the data, with the hopes that this will extraet more 
information from the data, and provide better eoverage properties. A series of more 
reeent papers have taken this type of approaeh, performing Bayesian or likelihood 
inferenee within an explieit evolutionary framework (MePeek and Strahs 1999, 
Morris et al 2000, Liu et al 2001, Morris et al 2002). 

In this artiele we provide an informal deseription of work in this area presented by 
one of the authors (JKP) at the DIMACS eonferenee. A more eomplete deseription 
will eventually beeome available elsewhere. 



The Problem 

We start by assuming that we have one sample of individuals who have a disease of 
interest (eases), and a sample of healthy eontrols. A partieular genomie region is 
believed to eontain a gene involved in disease suseeptibility (e.g., beeause this region 
was implieated in a prior linkage study, or beeause an assoeiation signal has been 
deteeted in the region). Eaeh of the sampled individuals has been genotyped at a set 
of SNPs aeross the region of interest. We assume that we have haplotype data 
available, or else that haplotypes have been estimated by a suitable method (e.g., 
Stephens et al 2001). Our goal is to estimate the loeation of the disease variant(s) in 
order to refine the seareh. Thus, a suitable output is to provide a “best guess” for the 
loeation, as well as an assoeiated eredible region or eonfidenee interval. 

Note that the experimental setup, as deseribed above, is one version of a range of 
possible seenarios. The general framework presented here is suffieiently flexible to 
eneompass other sorts of data, sueh as family-based “TDT” samples (Spielman et al 
1993), or quantitative phenotypes (though the latter is not yet implemented). In 
prineiple, haplotypes eould also be inferred within our eomputational seheme, though 
this would add eonsiderable further eomplexity to a problem that is already 
eomputationally ehallenging. 

When thinking about the stmeture of the data, it is helpful to think about the 
population genetie proeess that produees it. For this purpose, the most sueeessful elass 
of models is deseribed by a stoehastie proeess known as the “eoaleseent”, as follows 
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(reviewed by Hudson 1991, Nordborg 2001). Consider a partieular nueleotide 
position along the sequenee of the region of interest. Then all the ehromosomes in our 
sample are related at that position by a single unknown genealogy (Figure 1). Any 
DNA sequenee differenees among the ehromosomes at that position arise as the result 
of mutations that oeeur along the branehes of the genealogy. The genealogies at 
nearby positions along the sequenee may not be identieal, beeause reeombination 
during the history of the sample ean rearrange the topology. Loosely speaking, the 
effeet of reeombination ean be visualized as breaking some of the branehes in the 
genealogy and reattaehing them elsewhere. If the genetie distanee between two 
positions is small, then the genealogies at those positions will usually be quite similar 
beeause we would expeet few historieal reeombination events between those two 
points; eonversely, two positions far apart may have quite different aneestral 
genealogies. 




Fig. 1. Schematic showing the ancestral genealogies for ten sampled chromosomes (numbers 
0 — 9) at different positions across a chromosomal region. Past recombination events in this 
region have the effect of rearranging the branches in one genealogy relative to its neighbors, so 
that the three genealogies are all somewhat different from each other. However, the genealogies 
are still correlated with each other — for example, chromosomes 6 and 7 share a recent common 
ancestor across the entire region. The four boxed chromosomes are all sampled from case 
individuals. In the center of the region, they all form a single clade on the genealogy defined by 
a shared ancestral mutation (star); at each end of the region, this clade has been partially broken 
apart by recombination 



Now suppose that there is a disease susceptibility mutation somewhere in the 
region of interest (as with the star in Figure 1). This mutation can be thought of as 
defining a clade of the tree in which all the chromosomes share a recent common 
ancestor, and inherit that mutation. This idea generalizes, so that if there is more than 
one mutation at that position, then each independent mutation labels a clade of the 
tree in which all the descendent chromosomes carry that mutation. Chromosomes 
from affected individuals will have an increased probability of carrying one of these 
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mutations, while control chromosomes will have a decreased probability of carrying a 
mutation. Hence, the presence of unobserved disease mutations leads to clades of the 
tree where case chromosomes are over-represented; i.e., the case chromosomes have a 
tendency to cluster non-randomly on the genealogy. The overall extent of clustering 
depends on the effect size (penetrance) of the mutations — that is, it will be strongest 
for a rare Mendelian recessive disorder, and much weaker for a complex disease. As a 
result of recombination, the extent of the clustering decays (on average) as one moves 
along the chromosome, away from the actual position of the disease mutation. In 
effect, this is the signal that LD mapping relies on (often implicitly) in the search for 
disease genes. 

The role of the genetic marker data in this problem is that they provide partial 
information about the unknown genealogy at each position across the sequence. 
Indeed, under some plausible assumptions, it can be shown that if one is told the 
actual genealogy everywhere across the sequence, then the marker data are no longer 
of interest, as they contain no further information (Zdllner and Pritchard, unpublished 
results). Thus, we have pursued an approach to this problem that is based explicitly 
on trying to extract as much information as possible about the underlying genealogy 
at each position across the sequence. The inference about location of the disease 
mutation then makes use of the inferred genealogies. 



A Local Approximation of the Ancestral Recombination Graph 

The full multilocus version of the coalescent can be represented using a concept 
known as the ancestral recombination graph (ARG) (Nordborg 2001). In recent years, 
several other groups have attempted to perfonn full likelihood inference for 
population genetic parameters under the ARG, using either Markov chain Monte 
Carlo (MCMC) or Importance Sampling (Griffiths and Marjoram 1996, Kuhner et al 
2000, Nielsen 2000, Feamhead and Donnelly 2001; reviewed by Stephens 2001). The 
experience of these earlier groups indicates that it is quite difficult to construct 
algorithms that perform reliably on data sets of moderate or large size. The difficulty 
stems from the property that the space of ARGs is rather complicated, and alternative 
configurations of the ARG that are supported by the data may be separated by deep 
troughs of low likelihood that hinder mixing of MCMC schemes. 

Based on these considerations, we decided to implement a computational scheme 
that has much the same goal, but is technically simpler. In our approximation, we 
focus on one position along the sequence, at a time (and repeat this analysis at many 
positions across the entire region of interest). Our goal then is to use the full marker 
data to estimate the genealogy at that position. We use a Markov chain Monte Carlo 
method to sample from the posterior distribution of ancestral genealogies at that 
position, given the data, and the coalescent prior. While this algorithm does not 
implement the full ancestral recombination graph, it does capture the essential feature 
that pairs of chromosomes with recent common ancestors tend to share large regions 
of chromosome that are identical, while chromosomes with more distant ancestors 
tend to share smaller segments, as the result of recombination. 

The next phase of the inference procedure is to compute the likelihood of the 
phenotype data given the genealogy at a position x along the sequence. Computing 
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this likelihood is complicated by two features: we do not know in advance where on 
the genealogy the mutations might have occurred, and we do not know in advance 
what the penetrance values are for the different genotypes. For conducting inference 
in a Bayesian framework, it is most natural to integrate over these unknown values. 

In summary, the end result of this procedure is that at many positions across the 
region of interest, we sample coalescent genealogies from the posterior density given 
the marker data; then we compute the likelihood of the phenotype data given each of 
those genealogies. Ultimately this produces an estimate of the likelihood of the 
phenotype data assuming a disease mutation at each of the test points across the 
region of interest. The likelihoods can then be used to obtain a posterior credible 
region for the location of the disease mutation(s), or for significance testing to decide 
whether the data support the presence of disease variation in the region. 



Results and Discussion 

Sample results from application of this algorithm to simulated data are shown in 
Figure 2. Data were simulated using a standard coalescent model. The simulated 
samples included 100 case individuals, and 100 controls, each genotyped at 110 SNPs 
across the region. There were 3 disease mutations (all at the same map position), 
none of which was included in the marker set of SNPs. The three genotype 
penetrances were 0.1 (wildtype homozygote), 0.2 (heterozygote) and 0.4 (mutation 
homozygote). Thus, this represents a moderately difficult problem, perhaps typical of 
a complex disease data set. 

In this example, our posterior mode estimate for the location of the disease 
mutation is close to the true value (Figure 2). The credible region, however, is 
reasonably wide, encompassing on the order of lOOkb (assuming a uniform prior). 
Under the null hypothesis, the likelihood of these data is ~10'^°, substantially lower 
than observed here, indicating that we see significant evidence for a genetic effect in 
this region. More extensive tests of our method, on both simulated and real data are 
underway, and will be presented elsewhere. 

In summary, we have described our work on developing coalescent-based methods 
for LD fine mapping and association mapping. This work extends earlier methods 
(especially that of Morris et al 2002) in several respects. Most notably, we include 
both cases and controls within a single coalescent scheme, and we reframe the 
inference problem in terms of detecting excess clustering of case chromosomes in the 
genealogy. In contrast, the earlier methods generally assume a rather simple Markov 
model for LD in the controls. This simplifies the implementation somewhat, but is not 
ideal in other respects. First, by failing to model the LD in controls accurately, it is 
possible that the answers might be misleading when the Markov LD assumption is 
badly wrong. Second, modelling the cases and controls within a single framework 
makes it possible to consider much more general genetic models, including a full 
range of penetrance models. It is also conceptually straightforward to analyze 
quantitative traits within this framework (though we have not yet implemented this). 
Third, the previous methods were unable to perform significance testing, which we 
can do by asking whether the clustering of case chromosomes on the genealogy is 
more than random. 
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distance to disease mutation 



Fig. 2. Plot of the estimated likelihood of simulated genotype data as a function of the assumed 
position of unobserved disease variation. The data include 200 case and 200 control 
chromosomes, each genotyped at 110 SNPs across a 250KB region. Three disease mutations 
with modest effects on disease risk occur at the position indicated by the vertical black line. 
Further details are provided in the text 



A number of challenges remain in this type of problem. One thorny issue is that we 
assume that haplotype phase is known, while in practice it often is not known. In 
principle, one could include haplotype updates within the MCMC scheme, and infer 
the haplotypes simultaneously with everything else. However, it seems that this would 
add considerable complexity to an MCMC scheme that is already rather complicated, 
and at present we estimate haplotypes prior to analysis. Furthermore, these types of 
MCMC methods tend to be time-consuming to mn, and require close examination to 
check convergence and mixing, and to ensure that the results are reliable. 
Nonetheless, we are optimistic that such multipoint methods for LD mapping can 
soon have a substantial positive impact on the field of complex trait mapping. 
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Abstract. We consider the following problem: Given n genotypes, does 
there exist a set H of haplotypes such that each genotype is generated by 
a pair from this set, and this det can be derived on a perfect phylogeny. 
Recently, Gusfield, 2002, presented a polynomial time algorithm to solve 
this problem that uses established results from matroid and graph theory. 
In this work, we present an O(nm^) algorithm for this problem using 
elementary techniques. We also describe a linear space representation 
for representing all possible solutions, and provide a formula for counting 
the number of possible solutions. 
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Exhaustive Enumeration and Bayesian Phase 

Inference 
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Abstract. A strategy for haplotype inference will be described using 
graph-theoretic approaches to enumeration of admissible haplotype 
phases, coupled with Bayesian statistical inference for assessing likeli- 
hoods of alternative phasings. An application to the special problem of 
inference of phase from trisomic data (Down syndrome) will be presented. 
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How Does Choice of Polymorphism Influence 
Estimation of LD and Mapping? 
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Abstract. Fine-mapping and positional cloning studies may focus on 
only a subset of polymorphisms identified within a region. In part this 
may reflect the trade-off between information and economics. For ex- 
ample, for many of the genetic analytic approaches commonly used in 
Hne-mapping and positional cloning, there is little value in typing poly- 
morphisms that are in near-perfect linkage disequilibrium (LD) with each 
other. Consequently, investigators may focus initially on only those poly- 
morphisms that have unique patterns in some small subset of individuals 
used in screening studies. While this strategy is economical and scientifi- 
cally defensible for many common analytic approaches, it precludes unbi- 
ased estimation of the extent of LD in a region. Moreover, consequences 
for more sophisticated multipoint LD mapping approaches considering 
extended haplotypes are difficult to predict. We report here on studies of 
the NIDDMl region of 2qter and a region on chromosome 15 including 
CYP19 in which we have made a systematic examination of the effects of 
this type of polymorphism choice. While our conclusions are clearly lim- 
ited because we have studied only these two regions, our results suggest 
that multipoint linkage disequilibrium mapping methods can be sensitive 
to choice of polymorphism. In contrast, studies on the extent of LD in 
a region may be less sensitive to this type of polymorphism choice, as 
long as all common polymorphisms with unique patterns are included in 
analyses. 
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Abstract. The application of statistical methods to infer and recon- 
struct linkage phase in samples of diploid sequences is a potentially 
time- and labor-saving method. The Stephens-Smith-Donnelly (SSD) al- 
gorithm is one such method, which incorporates concepts from popula- 
tion genetics theory in a Markov chain-Monte Carlo technique. We ap- 
plied a modified SSD method, as well as the expectation-maximization 
and partition-ligation algorithms, to sequence data from eight loci span- 
ning >1.5 Mb on the human X chromosome. We demonstrate that the 
accuracy of the modified SSD method is better than that of the other 
algorithms and is superior in terms of the number of sites that may 
be processed. Also, we find phase reconstructions by the modified SSD 
method to be highly accurate over regions with high linkage disequi- 
librium (LD). If only polymorphisms with aminor allele frequency 10.2 
are analyzed and scored according to the fraction of neighbor relations 
correctly called, reconstructions are 95.2% accurate over entire 100-kb 
stretches and are 98.6% accurate within blocks of high LD. 
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of Haplotypes 
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Abstract. The talk will describe the general issues in Bayesian ap- 
proaches to haplotype reconstruction, including the distinction between 
choice of prior distribution, and the computational methods used to 
approximate posterior distributions. We describe recent improvements 
to the software PHASE, including handling missing data and improved 
computational methods, and compare its behaviour to some other pub- 
lished methods. 
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Abstract. We have developed several distinct combinatorial approaches 
to the haplotype inference problem. I will talk about a few of the most re- 
cent of these approaches. One approach, the “pure parsimony” approach 
is to find N pairs of haplotypes, one for each genotype, that explain the N 
genotypes and MINIMIZE the number of distinct haplotypes used. Solv- 
ing this problem is NP-hard, however, for reasonable size data (larger 
than in general use today), the “pure-parsimony” solution can be effi- 
ciently found in practice. I will also talk about an approach that mixes 
pure-parsimony with Clark’s subtraction method for haplotyping. Simu- 
lations show that the efficiently of both methods depends positively on 
the level of recombination - the more recombination, the more efficiency, 
but the accuracy depends inversely on the level of recombination. I will 
also discuss a practical ways to greatly boost the accuracy of Clark’s sub- 
traction method, and identify haplotype pairs with high confidence. This 
approach has been tested on molecularly determined data, which will be 
published along with the method. Comparisons are made with PHASE 
and HAPLOTYPER. I will also mention some recent developments in 
haplotype inference that are based on viewing the problem in the context 
of the perfect phylogeny problem. This builds on a near-linear-time algo- 
rithm to determine whether genotype (unphased) SNP data is consistent 
with the no-recombination, infinite sites coalescent model of haplotype 
evolution. Stated differently, whether there are haplotype pairs for the 
genotypes, which satisfy the 4-gamete condition for tree-form evolution. 
The algorithm finds in linear time an implicit representation of the set of 
all solutions to the problem. A detailed treatment of a simple alternative 
algorithm for that problem will be given in the talk by V. Bafna. 
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Abstract. Critical to the understanding of the genetic basis for com- 
plex diseases is the modeling of human variation. Most of this varia- 
tion can be characterized by single nucleotide polymorphisms (SNPs) 
which are mutations at a single nucleotide position. To characterize an 
individual’s variation, we must determine an individual’s haplotype or 
which nucleotide base occurs at each position of these common SNPs for 
each chromosome. In this paper, we present results for a highly accu- 
rate method for haplotype resolution from genotype data. Our method 
leverages a new insight into the underlying structure of haplotypes which 
shows that SNPs are organized in highly correlated “blocks”. The ma- 
jority of individuals have one of about four common haplotypes in each 
block. Our method partitions the SNPs into blocks and for each block, 
we predict the common haplotypes and each individual’s haplotype. We 
evaluate our method over biological data. Our method predicts the com- 
mon haplotypes perfectly and has a very low error rate (0.47%) when 
taking into account the predictions for the uncommon haplotypes. 
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Abstract. Haplotypes have become increasingly popular because of the 
abundance of single nucleotide polymorphisms (SNPs) and the limited 
power of the single-locus analyses. To contend with some weaknesses of 
the existing haplotype inference methods, we propose new algorithms 
based on the partition-ligation idea. In particular, we first partition the 
whole haplotype into smaller segments. Then, we use either the Gibbs 
sampler or the EM algorithm to construct the partial haplotypes of each 
segment and to assemble all the segments together. Our algorithm can 
infer haplotype frequencies rapidly and accurately for a large number of 
linked SNPs and provides a robust estimate of their standard deviations. 
The algorithms are robust to the violation of Hardy- Weinberg equilib- 
rium and can handle missing marker data easily. As a follow-up study, we 
also investigated two related questions: how much the haplotype infor- 
mation contributes to linkage disequilibrium (LD) mapping and whether 
an in silico haplotype construction preceding the LD analysis can help. 
For simple disease gene mapping our conclusions are as follows: (a) if a 
proper statistical model is employed, the loss of haplotype information 
for either control or disease data do not have a great impact on LD hne 
mapping, and (b) haplotype inference should be carried out jointly with 
LD analysis to achieve the most accurate location estimation. 
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Abstract. The hope behind association mapping is to use linkage dis- 
equilibrium (LD) as an indicator of proximity of a marker to a suscep- 
tibility locus. This follows from the expectation that marker-phenotype 
association is proportional to linkage disequilibrium, which is inversely 
related to recombination. If there are more than two alleles at a locus af- 
fecting risk, the association statistic is instead a weighted sum of linkage 
disequilibria and genotypic susceptibilities. There is no longer a simple 
relationship, even in expectation, with recombination. These results ex- 
tend to marker haplotypes. In addition to the pairwise association terms 
of the single marker tests, marker haplotype associations depend on the 
weighted sum of multi-locus disequilibria and genotypic susceptibilities. 
Several tests of haplotype association are presented here, along with a 
comparison of these tests within different LD contexts. 
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Abstract. Polymorphism data from 20 partially resequenced copies of 
human chromosome 21 — more than 20,000 polymorphic sites — are 
analyzed. The allele-frequency distribution shows no deviation from the 
simplest population genetic model with a constant population size (al- 
though we show that our analysis has no power to detect population 
growth). The average rate of recombination per site is estimated to be 
roughly one half of the rate of mutation per site, again in agreement 
with simple model predictions. However, sliding-window analyses of the 
amount of polymorphism and the extent of linkage disequilibrium (LD) 
shows signihcant deviations from standard models. This could be due to 
the history of selection or demographic change, but it is impossible to 
draw strong conclusions without much better knowledge of variation in 
the relationship between genetic and physical distance along the chro- 
mosome. 
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Abstract. We describe a novel coalescent-based method for estimating 
the location of a disease susceptibility locus. This is designed for the sit- 
uation where we have genotype data from a sample of cases and controls, 
in a region that is believed to contain a disease mutation. For a given 
position on the marker-map, we use the marker information of both cases 
and controls to reconstruct local approximations of the ancestral recom- 
bination graph using Markov Chain Monte Carlo. From this, we can 
compute the likelihood of the phenotype data assuming a susceptibility 
gene at this position; the procedure is repeated at a series of locations 
across the region to estimate the posterior density. 
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Abstract. Recent studies of linkage disequilibrium (LD) have suggested 
that (1) recombination rates vary tremendously across the genome, such 
that large scale estimates of the recombination rate based on a compar- 
ison of physical and genetic maps may not be informative about local 
patterns of LD (2) models of recombination that include gene conversion 
as well as crossing-over better predict levels of LD at short scales (3) 
levels of LD are lower in samples from sub-Saharan African populations 
than in other population samples. These observations have important 
implications for linkage disequilibrium based association-studies; they 
suggest that large areas of the genome can be tagged with few mark- 
ers, and that genome-wide studies and fine scale mapping efforts might 
best be conducted in different populations. To examine the generality 
of these observations, we analyzed over 80 data sets sequenced in 24 
African-Americans and 23 individuals of European descent (data from 
http://pga.mbt.washington.edu/). As an index of LD, we estimated the 
population rate of crossing-over q in the two ?population? samples. We 
compared estimates of p (with and without gene conversion) to those ob- 
tained from a comparison of genetic and physical maps. To gain a sense 
of recombination rate variation at a small scale, we also considered how 
much estimates of p vary along sequences in actual data compared to 
simulated data. 
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Abstract. A non-random association of disease and marker alleles on 
chromosomes in populations can arise as a consequence of historical 
forces such as mutation, selection and genetic drift, and is referred to 
as “linkage disequilibrium” (LD). LD can be used to estimate the map 
position of a disease mutation relative to a set of linked markers, as well 
as to estimate other parameters of interest, such as mutation age. Para- 
metric methods for estimating the location of a disease mutation using 
marker linkage disequilibrium in a sample of normal and affected indi- 
viduals require a detailed knowledge of population demography, and in 
particular require users to specify the postulated age of a mutation and 
past population growth rates. A new Bayesian method is presented for 
jointly estimating the position of a disease mutation and its age. The 
method is illustrated using haplotype data for the cystic fibrosis Delta 
F508 mutation in europe and the DTD mutation in Finland. It is shown 
that, for these datasets, the posterior probability distribution of disease 
mutation location is insensitive to the population growth rate when the 
model is averaged over possible mutation ages (using a prior for age based 
on the population frequency of the disease mutation). Fewer assumptions 
are therefore needed for parametric LD mapping. 
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Abstract. Association studies, both family-based and population-based, 
can be powerful means of detecting disease-liability alleles. To increase 
the information of the test, various researchers have proposed targeting 
haplotypes. The larger number of haplotypes, however, relative to alleles 
at individual loci, could decrease power because of the additional de- 
grees of freedom required for the test. An optimal strategy would focus 
the test on particular haplotypes or groups of haplotypes, much as is 
done with cladistic-based association analysis. First suggested by Tem- 
pleton and colleagues, such analyses use the evolutionary relationships 
among haplotypes to produce a limited set of hypothesis tests and to 
increase the interpretability of these tests. To more fully utilize the in- 
formation contained in the evolutionary relationships among haplotypes 
and in the sample, we propose generalized linear models (GLM) for the 
analysis of data from family-based and population-based studies. These 
models fully account for haplotype phase ambiguity and allow for co- 
variates. The models are encoded into a software package, EHAP (for 
Evolutionary-based Haplotype Analysis Package), which also provides 
for various kinds of exploratory data analysis. The exploratory analyses, 
such as error checking, estimation of haplotype frequencies, and tools for 
building cladograms, should facilitate the implementation of cladistic- 
based association analysis with haplotypes. 
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Abstract. The determination of complete human genome sequences, 
subsequent work on mapping human genetic variations, and advances in 
laboratory technology for screening for these variations in large popula- 
tions are together generating tremendous interest in genetic association 
studies as a means for characterizing the genetic basis of common hu- 
man diseases. Considerable recent work has focused on using haplotypes 
to reduce redundancy in the datasets, improving our ability to detect 
significant correlations between genotype and phenotype while simulta- 
neously reducing the cost of performing assays. A key step in applying 
haplotypes to human association studies is determining regions of the 
human genome that have been inherited intact by large portions of the 
human population from ancient ancestors. This talk describes computa- 
tional methods for the problem of predicting segments of shared ancestry 
within a genetic region among a set of individuals. Our approach is based 
on what we call the haplotype coloring problem: coloring segments of a 
set of sequences such that like colors indicate likely descent from a com- 
mon ancestor. I will present two methods for this problem. The first 
uses the notion of ?haplotype blocks” to develop a two-stage coloring 
algorithm. The second is based on a block-free probabilistic model of 
sequence generation that can be optimized to yield a likely coloring. I 
will describe both methods and illustrate their performance using real 
and contrived data sets. 
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Abstract. The problem of testing for significant differences in haplo- 
type frequencies between a random sample of individuals (randoms) and 
a sample of individuals with a genetic disease (cases) is considered. The 
questions are (1) What is the statistical power in testing for differences in 
haplotype frequencies? and (2) How much statistical power is lost when 
haplotype phase cannot be resolved but instead must be inferred using 
a maximum likelihood method? A likelihood ratio test of differences in 
haplotype frequencies in randoms and cases is used, and the theory is 
developed in terms of the non-centrality parameter of the non-central 
chi-square distribution. If a causative allele has a multiplicative effect on 
penetrance, and if sample sizes are large and the effect of the causative 
allele is small, analytic expressions for the non-centrality parameter are 
obtained when haplotypes can and cannot be resolved. The loss in power 
is independent of the frequency of the causative allele and can in general 
be compensated for by increasing the sample sizes by a factor of two or 
less. For a dominant causative allele, numerical results are obtained that 
show the important features of the results for multiplicative penetrance 
still are valid. We conclude that maximum likelihood inference of hap- 
lotype frequencies and a likelihood ratio test of differences in inferred 
frequencies can be useful in a case-control setting. 
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Abstract. Current methods for understanding the relationship between 
LD and the underlying recombination rate are limited. The most common 
approach is to compute a measure of LD between every pair of sites 
in the region, and to form a graphical display of the results. However, 
it is typically difficult to assess the significance of observed patterns. 
More sophisticated coalescent-based statistical methods for estimating 
local recombination rate from patterns of LD are either computationally 
impractical for moderate-sized regions, or suffer from loss of information 
by using only a summary of the data. Furthermore, they all assume 
constant recombination rate, making them poor tools for studying local 
recombination rates. Here we propose a novel computationally-tractable 
model for LD across multiple loci. We apply this model to the problem 
of inferring recombination rates from population data, and in particular 
to identifying variation in the local recombination rate (’’hotspots” and 
’’coldspots”) long chromosomes. We outline how this model might be 
used to develop more powerful methods for LD mapping. 
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Abstract. We develop a dynamic programming algorithm for haplo- 
type block partitioning to minimize the number of representative single 
nucleotide polymorphisms (SNPs) required to account for most of the 
haplotype quality in each block. The block quality is a function of the 
haplotypes dehned by the SNPs in the block. Any measure of haplotype 
quality can be used in the algorithm and of course the measure should 
depend on the specific application. The dynamic programming algorithm 
is applied to analyze the haplotype data on chromosome 21 of Patil et 
al. Using the same criteria as in Patil et al. (6), we identify a total of 
3,582 representative SNPs and 2,575 blocks which are 21.5% and 37.7%, 
respectively, smaller than those identified using a greedy algorithm of 
Patil et al. We also compare the power of association studies using all 
SNPs, tag SNPs and the same number of randomly chosen SNPs. 
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Abstract. The haplotype structure of a population derives from the re- 
combination events in meioses that are ancestral to current population 
members. In very small populations of conservation importance, the ex- 
tent of genome survival depends also on recombination which distributes 
founder genome across the population. Relative to a founder population, 
a junction is a recombination point between genomes of different founder 
origins. In comparing two extant genomes, the segments shared IBD will 
be bounded by external junctions and may include internal junctions 
shared by both genomes. Thus IBD tracts are made up of a random 
number of segments bounded by junctions. A study of the process of 
internal and external junction types along pairs of chromosomes sam- 
pled from a population leads to new results on the variance of lengths 
of genome shared between relatives. This research is based on work with 
Dr. N. Chapman. 
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Abstract. With the aim of developing a linkage disequilibrium (LD) 
SNP map to serve as a resource for candidate-gene, candidate-region and 
whole-genome association studies, we have genotyped >250,000 SNPs 
on 90 DNA samples (45 African-American, 45 Caucasian, unrelated) se- 
lected from the Coriell Human variation collection. The individual geno- 
types thus generated have enabled us to survey the patterns of LD and 
haplotype diversity across all gene regions of the human genome. Here I 
describe the empirical results of the first comparative study of the pat- 
terns of LD across three entire human autosomes: Chromosomes 6, 21, 
and 22. We selected for the study a total of 17,966 SNPs covering more 
than 209 Mb of chromosomal segments, and overlapping 2,266 predicted 
gene regions, with a minor allele frequency greater than 10% in either 
population, and that were in Hardy- Weinberg equilibrium (p>0.01). Sev- 
eral methods to define ?haplotype blocks? were applied to this dataset, 
including several forms of the D? method and the 4-gamete rule. Hap- 
lotypes were then computationally inferred for the markers within each 
block by the EM algorithm to assess haplotype diversity. In addition, a 
subset of 277 SNPs spanning 4 Mb across the HLA region on chromosome 
6 was genotyped on 550 DNA samples of unrelated individuals of Eu- 
ropean ancestry from north Germany, 93 samples from Norway, and 77 
samples from UK. We analyze the robustness of the different haplotype 
block definitions, the differences between the population samples, and the 
effect of sample size on the generalization of haplotype blocks defined in 
one given population sample. Finally, I present the preliminary results 
of haplotype-based power calculations for case-control studies across the 
gene regions of these three chromosomes. 
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Abstract. We have developed a software analysis package, HapScope, 
which includes a comprehensive analysis pipeline and a sophisticated 
visualization tool for analyzing functionally annotated haplotypes. The 
HapScope analysis pipeline supports: a) computational haplotype con- 
struction with an EM or Bayesian statistical algorithm; b) SNP classifica- 
tion by protein coding change, homology to model organisms or putative 
regulatory regions; c) minimum SNP subset selection by either a Brute 
Force Algorithm or a Greedy Partition Algorithm. The HapScope viewer 
displays genomic structure with haplotype information in an integrated 
environment, providing eight alternative views for assessing genetic and 
functional correlation. It has a user-friendly interface for: a) haplotype 
block visualization; b) SNP subset selection; c) haplotype consolidation 
with subset SNP markers; d) incorporation of both experimentally de- 
termined haplotypes and computational results; e) data export for ad- 
ditional analysis. Comparison of haplotypes constructed by the statisti- 
cal algorithms with those determined experimentally shows variation in 
haplotype prediction accuracies in genomic regions with different levels 
of nucleotide diversity. We have applied HapScope in analyzing haplo- 
types for candidate genes and genomic regions with extensive SNP and 
genotype data. We envision that the systematic approach of integrating 
functional genomic analysis with population haplotypes, supported by 
HapScope, will greatly facilitate current genetic disease research. 
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Assessment of Goodness of Fit of Models for 
Block Haplotype Structure 
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Abstract. Our aim is to formalize models for high-resolution haplotype 
structure in such a way that they can be useful in statistical methods 
for LD mapping. Some steps in that direction have been taken by Da- 
ley et al. ’01, who outline a hidden Markov model (HMM) that allows 
for common haplotypes in each block. We propose somewhat different 
models that also use HMM. In this talk, we address the problem of as- 
sessing goodness of fit of particular models, where each model involves 
choices such as number and positions of blocks and common haplotypes 
in each block. Our models also allow for haplotypes in a block that are 
not one of the common types. We discuss choice of goodness-of-ht statis- 
tic, parametrization, and computational issues involved in assessing the 
ht of background LD models to data. 
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